Infini-AI-Lab / MagicPIGLinks

[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation

☆240

Alternatives and similar repositories for MagicPIG

Users that are interested in MagicPIG are comparing it to the libraries listed below

Sorting:

FasterDecoding / SnapKV
☆290Updated 4 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆353Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆171Updated last month
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆143Updated 9 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆107Updated 7 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆240Updated last year
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆206Updated 5 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆154Updated last month
jy-yuan / KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆332Updated last month
henryzhongsc / longctx_bench
Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…
☆86Updated 8 months ago
FasterDecoding / TEAL
☆148Updated 9 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆146Updated 9 months ago
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆112Updated 8 months ago
FasterDecoding / REST
REST: Retrieval-Based Speculative Decoding, NAACL 2024
☆210Updated 2 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆501Updated 9 months ago
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆130Updated 3 weeks ago
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆191Updated last year
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆150Updated 4 months ago
dilab-zju / self-speculative-decoding
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
☆208Updated 9 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆328Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆390Updated last year
LiuXiaoxuanPKU / OSD
☆60Updated 11 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆259Updated last month
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 3 months ago
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆113Updated last month
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆487Updated last year
hyx1999 / SAM-Decoding
Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton
☆36Updated 9 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆63Updated last year