Explorations into some recent techniques surrounding speculative decoding
β300Dec 22, 2024Updated last year
Alternatives and similar repositories for speculative-decoding
Users that are interested in speculative-decoding are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast inference from large lauguage models via speculative decodingβ914Aug 22, 2024Updated last year
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,206Apr 18, 2026Updated 2 weeks ago
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated 2 years ago
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β106Dec 2, 2024Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ97Feb 6, 2024Updated 2 years ago
- End-to-end encrypted email - Proton Mail β’ AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β226Feb 13, 2025Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β218Mar 5, 2026Updated 2 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β99Aug 20, 2023Updated 2 years ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β391Apr 22, 2025Updated last year
- Cascade Speculative Draftingβ33Apr 2, 2024Updated 2 years ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)β65Sep 28, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,333Mar 6, 2025Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,313Feb 20, 2026Updated 2 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ110Feb 29, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- β16Aug 19, 2024Updated last year
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,730Jun 25, 2024Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findingsβ46May 23, 2023Updated 2 years ago
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ82Jul 14, 2025Updated 9 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β368Apr 13, 2026Updated 3 weeks ago
- β606Aug 23, 2024Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amountβ¦β53Oct 22, 2023Updated 2 years ago
- β29May 24, 2025Updated 11 months ago
- β355Apr 2, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,875Apr 29, 2026Updated last week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsityβ243Sep 24, 2023Updated 2 years ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ279Aug 31, 2024Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ146Dec 4, 2024Updated last year
- β26Mar 14, 2024Updated 2 years ago
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).β11May 24, 2024Updated last year
- β28Feb 27, 2025Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β68Jun 26, 2024Updated last year
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ548May 16, 2025Updated 11 months ago
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Implementation of GateLoop Transformer in Pytorch and Jaxβ92Jun 18, 2024Updated last year
- Crawl & visualize ICLR papers and reviews.β18Nov 5, 2022Updated 3 years ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]β49Apr 29, 2026Updated last week
- Serving multiple LoRA finetuned LLM as oneβ1,156May 8, 2024Updated last year
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weightsβ19Oct 9, 2022Updated 3 years ago
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Draftingβ18Mar 4, 2025Updated last year
- The official repo of continuous speculative decodingβ33Mar 28, 2025Updated last year