π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,126Jan 24, 2026Updated last month
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β369Apr 22, 2025Updated 10 months ago
- Fast inference from large lauguage models via speculative decodingβ888Aug 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,201Feb 20, 2026Updated last week
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ62Feb 21, 2025Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Feb 13, 2025Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ147Dec 23, 2025Updated 2 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β214Sep 11, 2025Updated 5 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,315Mar 6, 2025Updated 11 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,708Jun 25, 2024Updated last year
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ277Aug 31, 2024Updated last year
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β114Mar 20, 2025Updated 11 months ago
- Multi-Candidate Speculative Decodingβ39Apr 22, 2024Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β661Updated this week
- A curated list for Efficient Large Language Modelsβ1,954Jun 17, 2025Updated 8 months ago
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,022Updated this week
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β46Dec 9, 2023Updated 2 years ago
- Awesome LLM compression research papers and tools.β1,780Feb 23, 2026Updated last week
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β54Mar 14, 2025Updated 11 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β65Jun 26, 2024Updated last year
- scalable and robust tree-based speculative decoding algorithmβ370Jan 28, 2025Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β816Mar 6, 2025Updated 11 months ago
- FlashInfer: Kernel Library for LLM Servingβ5,009Feb 23, 2026Updated last week
- A throughput-oriented high-performance serving framework for LLMsβ946Oct 29, 2025Updated 4 months ago
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,863Updated this week
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ74Jul 14, 2025Updated 7 months ago
- β64Dec 3, 2024Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ358Nov 20, 2025Updated 3 months ago
- Paper list for Efficient Reasoning.β828Updated this week
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ464May 30, 2025Updated 9 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ144Dec 4, 2024Updated last year
- β28May 24, 2025Updated 9 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ374Jul 10, 2025Updated 7 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ177Jul 12, 2024Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β1,025Sep 4, 2024Updated last year
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ40Feb 13, 2025Updated last year
- β596Aug 23, 2024Updated last year
- My learning notes for ML SYS.β5,444Jan 30, 2026Updated last month
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β3,919Updated this week
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β4,843Updated this week