Fast inference from large lauguage models via speculative decoding
β899Aug 22, 2024Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,145Mar 9, 2026Updated last week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,719Jun 25, 2024Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β215Mar 5, 2026Updated 2 weeks ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β372Apr 22, 2025Updated 11 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β219Feb 13, 2025Updated last year
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,229Feb 20, 2026Updated last month
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ110Feb 29, 2024Updated 2 years ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,322Mar 6, 2025Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,864Updated this week
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ277Aug 31, 2024Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ148Dec 23, 2025Updated 2 months ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ96Feb 6, 2024Updated 2 years ago
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ65Feb 21, 2025Updated last year
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,062Updated this week
- FlashInfer: Kernel Library for LLM Servingβ5,145Mar 15, 2026Updated last week
- β28May 24, 2025Updated 9 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β506Aug 1, 2024Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β99Aug 20, 2023Updated 2 years ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β818Mar 6, 2025Updated last year
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,463Jul 17, 2025Updated 8 months ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β4,953Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β3,958Updated this week
- β26Mar 14, 2024Updated 2 years ago
- β15Aug 19, 2024Updated last year
- β64Dec 3, 2024Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ649Jan 15, 2026Updated 2 months ago
- β306Jul 10, 2025Updated 8 months ago
- β155Mar 4, 2025Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β785Apr 6, 2025Updated 11 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ531Feb 10, 2025Updated last year
- Transformer related optimization, including BERT, GPTβ6,397Mar 27, 2024Updated last year
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β101Dec 2, 2024Updated last year
- scalable and robust tree-based speculative decoding algorithmβ372Jan 28, 2025Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ336Jul 2, 2024Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β674Feb 24, 2026Updated 3 weeks ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ359Nov 20, 2025Updated 4 months ago
- β353Apr 2, 2024Updated last year
- flash attention tutorial written in python, triton, cuda, cutlassβ491Jan 20, 2026Updated 2 months ago