π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,259Jun 2, 2026Updated 3 weeks ago
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β398Apr 22, 2025Updated last year
- Fast inference from large lauguage models via speculative decodingβ917Aug 22, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,418Feb 20, 2026Updated 4 months ago
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ70Feb 21, 2025Updated last year
- Explorations into some recent techniques surrounding speculative decodingβ307Dec 22, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β229Feb 13, 2025Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ165Dec 23, 2025Updated 6 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,751Jun 25, 2024Updated 2 years ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β219Mar 5, 2026Updated 3 months ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ281Aug 31, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,337Mar 6, 2025Updated last year
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β47Dec 9, 2023Updated 2 years ago
- Multi-Candidate Speculative Decodingβ41Apr 22, 2024Updated 2 years ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β117Mar 20, 2025Updated last year
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,355Jun 23, 2026Updated last week
- A curated list for Efficient Large Language Modelsβ2,019Jun 17, 2025Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β720Apr 15, 2026Updated 2 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β56Mar 14, 2025Updated last year
- scalable and robust tree-based speculative decoding algorithmβ376Jan 28, 2025Updated last year
- Awesome LLM compression research papers and tools.β1,848Feb 23, 2026Updated 4 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β71Jun 26, 2024Updated 2 years ago
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ83Jul 14, 2025Updated 11 months ago
- Paper list for Efficient Reasoning.β893May 29, 2026Updated last month
- Simple, predictable pricing with DigitalOcean hosting β’ AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- β30May 24, 2025Updated last year
- β68Dec 3, 2024Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,890Updated this week
- FlashInfer: Kernel Library for LLM Servingβ5,867Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ539Feb 10, 2025Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ962Mar 29, 2026Updated 3 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β845Mar 6, 2025Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ414Nov 20, 2025Updated 7 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ151Dec 4, 2024Updated last year
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ50May 12, 2026Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ498Jun 10, 2026Updated 3 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ397Jul 10, 2025Updated 11 months ago
- My learning notes for ML SYS.β6,590Updated this week
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ111Feb 29, 2024Updated 2 years ago
- β611Aug 23, 2024Updated last year
- Codes for our paper "Enhancing Continual Relation Extraction via Classifier Decomposition" (Findings of ACL2023)β10Nov 29, 2023Updated 2 years ago