π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,222May 11, 2026Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β392Apr 22, 2025Updated last year
- Fast inference from large lauguage models via speculative decodingβ914Aug 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,343Feb 20, 2026Updated 3 months ago
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ67Feb 21, 2025Updated last year
- Explorations into some recent techniques surrounding speculative decodingβ302Dec 22, 2024Updated last year
- Deploy open-source AI quickly and easily - Special Bonus Offer β’ AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β227Feb 13, 2025Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ162Dec 23, 2025Updated 4 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,741Jun 25, 2024Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β220Mar 5, 2026Updated 2 months ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ278Aug 31, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,337Mar 6, 2025Updated last year
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β47Dec 9, 2023Updated 2 years ago
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated 2 years ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β116Mar 20, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,229Apr 20, 2026Updated last month
- A curated list for Efficient Large Language Modelsβ2,008Jun 17, 2025Updated 11 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β705Apr 15, 2026Updated last month
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β57Mar 14, 2025Updated last year
- scalable and robust tree-based speculative decoding algorithmβ377Jan 28, 2025Updated last year
- Awesome LLM compression research papers and tools.β1,833Feb 23, 2026Updated 2 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β68Jun 26, 2024Updated last year
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ82Jul 14, 2025Updated 10 months ago
- Paper list for Efficient Reasoning.β886May 11, 2026Updated last week
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- β30May 24, 2025Updated 11 months ago
- β66Dec 3, 2024Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,878Updated this week
- FlashInfer: Kernel Library for LLM Servingβ5,621Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ541Feb 10, 2025Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ959Mar 29, 2026Updated last month
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β838Mar 6, 2025Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ399Nov 20, 2025Updated 6 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ148Dec 4, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ49May 12, 2026Updated last week
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ483May 30, 2025Updated 11 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ384Jul 10, 2025Updated 10 months ago
- β605Aug 23, 2024Updated last year
- My learning notes for ML SYS.β6,312Apr 23, 2026Updated 3 weeks ago
- Codes for our paper "Enhancing Continual Relation Extraction via Classifier Decomposition" (Findings of ACL2023)β10Nov 29, 2023Updated 2 years ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β5,339Updated this week