π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,145Mar 9, 2026Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β372Apr 22, 2025Updated 11 months ago
- Fast inference from large lauguage models via speculative decodingβ899Aug 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,229Feb 20, 2026Updated last month
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ65Feb 21, 2025Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β219Feb 13, 2025Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ148Dec 23, 2025Updated 2 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,719Jun 25, 2024Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β215Mar 5, 2026Updated 2 weeks ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ277Aug 31, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,322Mar 6, 2025Updated last year
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β46Dec 9, 2023Updated 2 years ago
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β54Mar 14, 2025Updated last year
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,062Updated this week
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β117Mar 20, 2025Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β674Feb 24, 2026Updated 3 weeks ago
- A curated list for Efficient Large Language Modelsβ1,967Jun 17, 2025Updated 9 months ago
- scalable and robust tree-based speculative decoding algorithmβ372Jan 28, 2025Updated last year
- Awesome LLM compression research papers and tools.β1,789Feb 23, 2026Updated 3 weeks ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β68Jun 26, 2024Updated last year
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ76Jul 14, 2025Updated 8 months ago
- Paper list for Efficient Reasoning.β856Updated this week
- β28May 24, 2025Updated 9 months ago
- β64Dec 3, 2024Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,864Updated this week
- FlashInfer: Kernel Library for LLM Servingβ5,145Mar 15, 2026Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ531Feb 10, 2025Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ949Oct 29, 2025Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β818Mar 6, 2025Updated last year
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ44Feb 13, 2025Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ359Nov 20, 2025Updated 4 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ144Dec 4, 2024Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ466May 30, 2025Updated 9 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ377Jul 10, 2025Updated 8 months ago
- My learning notes for ML SYS.β5,737Updated this week
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ110Feb 29, 2024Updated 2 years ago
- β599Aug 23, 2024Updated last year
- Codes for our paper "Enhancing Continual Relation Extraction via Classifier Decomposition" (Findings of ACL2023)β10Nov 29, 2023Updated 2 years ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β4,953Updated this week