hpcaitech / CachedEmbeddingLinks

A memory efficient DLRM training solution using ColossalAI

☆105

Alternatives and similar repositories for CachedEmbedding

Users that are interested in CachedEmbedding are comparing it to the libraries listed below

Sorting:

hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆116Updated 7 months ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 4 years ago
anyscale / llm-continuous-batching-benchmarks
☆120Updated last year
deepspeedai / DeepSpeed-Kernels
☆74Updated 3 months ago
Michaelvll / llm-ie-benchmarks
A collection of reproducible inference engine benchmarks
☆32Updated 2 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
Ascend / AscendSpeed
☆79Updated last year
hpcaitech / Titans
A collection of models built with ColossalAI
☆32Updated 2 years ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 10 months ago
mlc-ai / llm-perf-bench
☆120Updated last year
MayDomine / Burst-Attention
Distributed IO-aware Attention algorithm
☆20Updated 10 months ago
hpcaitech / Elixir
Elixir: Train a Large Language Model on a Small GPU Cluster
☆15Updated 2 years ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
stanford-futuredata / stk
☆106Updated 10 months ago
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆97Updated 8 months ago
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆243Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆302Updated 2 years ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆121Updated 7 months ago
neuralmagic / AutoFP8
☆195Updated 2 months ago
casper-hansen / AutoAWQ_kernels
☆75Updated 7 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
vllm-project / vllm-nccl
Manages vllm-nccl dependency
☆17Updated last year
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆79Updated 4 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆80Updated last week
triton-inference-server / hugectr_backend
☆55Updated last year
pytorch / torchsnapshot
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…
☆158Updated 3 weeks ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆47Updated 11 months ago
volcengine / veGiantModel
☆220Updated last year
qhliu26 / Dive-into-Big-Model-Training
📑 Dive into Big Model Training
☆114Updated 2 years ago