hpcaitech / ElixirLinks

Elixir: Train a Large Language Model on a Small GPU Cluster

☆15

Alternatives and similar repositories for Elixir

Users that are interested in Elixir are comparing it to the libraries listed below

Sorting:

Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
stanford-futuredata / stk
☆113Updated last year
anyscale / llm-continuous-batching-benchmarks
☆122Updated last year
microsoft / chunk-attention
☆81Updated 7 months ago
tanyuqian / redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
☆68Updated 11 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
vedantroy / gpu_kernels
☆27Updated last year
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 8 months ago
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆121Updated 11 months ago
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆112Updated 8 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 6 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆52Updated last year
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆93Updated 2 years ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
thu-pacman / FasterMoE
☆88Updated 3 years ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆60Updated last year
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆95Updated last year
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆217Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
Jingyu6 / speculative_prefill
☆46Updated 6 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆63Updated last year
sail-sg / VocabularyParallelism
Vocabulary Parallelism
☆24Updated 8 months ago
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆43Updated 3 years ago
InternLM / turbomind
☆97Updated 7 months ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 3 months ago