AniZpZ / vllmLinks

A high-throughput and memory-efficient inference and serving engine for LLMs

☆8

Alternatives and similar repositories for vllm

Users that are interested in vllm are comparing it to the libraries listed below

Sorting:

AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆102Updated 2 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆128Updated 2 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆141Updated last year
FasterDecoding / TEAL
☆130Updated 4 months ago
linfeng93 / BiTA
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
☆26Updated 2 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆79Updated 9 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆113Updated last month
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
InternLM / turbomind
☆87Updated 3 months ago
Adlik / smoothquantplus
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆23Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆116Updated 6 months ago
zankner / Hydra
☆45Updated last year
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆137Updated last month
neuralmagic / AutoFP8
☆194Updated last month
ClubieDong / QAQ-KVCacheQuantization
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆51Updated last year
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆106Updated 3 months ago
d-matrix-ai / keyformer-llm
☆54Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆163Updated 11 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆57Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆100Updated 3 weeks ago
INT-FlashAttention2024 / INT-FlashAttention
☆75Updated 5 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆54Updated last year
NickL77 / BaldEagle
3x Faster Inference; Unofficial implementation of EAGLE Speculative Decoding
☆66Updated last week
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆172Updated 2 months ago
microsoft / chunk-attention
☆77Updated 2 months ago
Intelligent-Computing-Lab-Yale / GPTAQ
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆47Updated 3 weeks ago
sail-sg / VocabularyParallelism
Vocabulary Parallelism
☆19Updated 3 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆123Updated 4 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆310Updated 11 months ago
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆115Updated last year