Explorations into some recent techniques surrounding speculative decoding
β305Dec 22, 2024Updated last year
Alternatives and similar repositories for speculative-decoding
Users that are interested in speculative-decoding are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast inference from large lauguage models via speculative decodingβ917Aug 22, 2024Updated last year
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,253Jun 2, 2026Updated last week
- Multi-Candidate Speculative Decodingβ41Apr 22, 2024Updated 2 years ago
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β110Dec 2, 2024Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ98Feb 6, 2024Updated 2 years ago
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β228Feb 13, 2025Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β219Mar 5, 2026Updated 3 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β99Aug 20, 2023Updated 2 years ago
- Cascade Speculative Draftingβ33Apr 2, 2024Updated 2 years ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β397Apr 22, 2025Updated last year
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)β66Sep 28, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,337Mar 6, 2025Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,397Feb 20, 2026Updated 3 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ111Feb 29, 2024Updated 2 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β16Aug 19, 2024Updated last year
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,751Jun 25, 2024Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findingsβ46May 23, 2023Updated 3 years ago
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ83Jul 14, 2025Updated 11 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β372Apr 13, 2026Updated 2 months ago
- β610Aug 23, 2024Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amountβ¦β53Oct 22, 2023Updated 2 years ago
- β30May 24, 2025Updated last year
- β359Apr 2, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,887Updated this week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsityβ246Sep 24, 2023Updated 2 years ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ281Aug 31, 2024Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ151Dec 4, 2024Updated last year
- β28Mar 14, 2024Updated 2 years ago
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).β11May 24, 2024Updated 2 years ago
- β28Feb 27, 2025Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β69Jun 26, 2024Updated last year
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ547May 16, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Implementation of GateLoop Transformer in Pytorch and Jaxβ92Jun 18, 2024Updated last year
- Crawl & visualize ICLR papers and reviews.β18Nov 5, 2022Updated 3 years ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]β51Apr 29, 2026Updated last month
- Serving multiple LoRA finetuned LLM as oneβ1,159May 8, 2024Updated 2 years ago
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weightsβ19Oct 9, 2022Updated 3 years ago
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Draftingβ18Mar 4, 2025Updated last year
- The official repo of continuous speculative decodingβ35Mar 28, 2025Updated last year