π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,204Apr 18, 2026Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β389Apr 22, 2025Updated last year
- Fast inference from large lauguage models via speculative decodingβ914Aug 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,299Feb 20, 2026Updated 2 months ago
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ66Feb 21, 2025Updated last year
- Explorations into some recent techniques surrounding speculative decodingβ300Dec 22, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β226Feb 13, 2025Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ160Dec 23, 2025Updated 4 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,727Jun 25, 2024Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β218Mar 5, 2026Updated last month
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ279Aug 31, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,333Mar 6, 2025Updated last year
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β47Dec 9, 2023Updated 2 years ago
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated 2 years ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β56Mar 14, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β116Mar 20, 2025Updated last year
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,185Apr 20, 2026Updated last week
- A curated list for Efficient Large Language Modelsβ1,993Jun 17, 2025Updated 10 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β694Apr 15, 2026Updated 2 weeks ago
- scalable and robust tree-based speculative decoding algorithmβ377Jan 28, 2025Updated last year
- Awesome LLM compression research papers and tools.β1,824Feb 23, 2026Updated 2 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β68Jun 26, 2024Updated last year
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ82Jul 14, 2025Updated 9 months ago
- Paper list for Efficient Reasoning.β879Apr 21, 2026Updated last week
- Proton VPN Special Offer - Get 70% off β’ AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- β28May 24, 2025Updated 11 months ago
- β66Dec 3, 2024Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,873Apr 23, 2026Updated last week
- FlashInfer: Kernel Library for LLM Servingβ5,498Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ540Feb 10, 2025Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ954Mar 29, 2026Updated last month
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β834Mar 6, 2025Updated last year
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ47Feb 13, 2025Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ387Nov 20, 2025Updated 5 months ago
- End-to-end encrypted cloud storage - Proton Drive β’ AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ146Dec 4, 2024Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ480May 30, 2025Updated 11 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ381Jul 10, 2025Updated 9 months ago
- My learning notes for ML SYS.β6,110Apr 23, 2026Updated last week
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ110Feb 29, 2024Updated 2 years ago
- β606Aug 23, 2024Updated last year
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β5,186Apr 24, 2026Updated last week