Explorations into some recent techniques surrounding speculative decoding
β300Dec 22, 2024Updated last year
Alternatives and similar repositories for speculative-decoding
Users that are interested in speculative-decoding are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast inference from large lauguage models via speculative decodingβ910Aug 22, 2024Updated last year
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,185Mar 31, 2026Updated 2 weeks ago
- Multi-Candidate Speculative Decodingβ40Apr 22, 2024Updated last year
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β106Dec 2, 2024Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ97Feb 6, 2024Updated 2 years ago
- Serverless GPU API endpoints on Runpod - Bonus Credits β’ AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β224Feb 13, 2025Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β217Mar 5, 2026Updated last month
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β99Aug 20, 2023Updated 2 years ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β383Apr 22, 2025Updated 11 months ago
- Cascade Speculative Draftingβ33Apr 2, 2024Updated 2 years ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)β65Sep 28, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,327Mar 6, 2025Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,273Feb 20, 2026Updated last month
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ110Feb 29, 2024Updated 2 years ago
- Deploy open-source AI quickly and easily - Bonus Offer β’ AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- β15Aug 19, 2024Updated last year
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,722Jun 25, 2024Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findingsβ46May 23, 2023Updated 2 years ago
- [ACL 2026 (Main)] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ79Jul 14, 2025Updated 9 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β364Apr 8, 2026Updated last week
- β602Aug 23, 2024Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amountβ¦β53Oct 22, 2023Updated 2 years ago
- β28May 24, 2025Updated 10 months ago
- β355Apr 2, 2024Updated 2 years ago
- Virtual machines for every use case on DigitalOcean β’ AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,872Updated this week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsityβ240Sep 24, 2023Updated 2 years ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ279Aug 31, 2024Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ145Dec 4, 2024Updated last year
- β26Mar 14, 2024Updated 2 years ago
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).β11May 24, 2024Updated last year
- β28Feb 27, 2025Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β68Jun 26, 2024Updated last year
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ548May 16, 2025Updated 11 months ago
- Wordpress hosting with auto-scaling - Free Trial β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Implementation of GateLoop Transformer in Pytorch and Jaxβ92Jun 18, 2024Updated last year
- Crawl & visualize ICLR papers and reviews.β18Nov 5, 2022Updated 3 years ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]β48Feb 17, 2026Updated last month
- Serving multiple LoRA finetuned LLM as oneβ1,152May 8, 2024Updated last year
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weightsβ19Oct 9, 2022Updated 3 years ago
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Draftingβ17Mar 4, 2025Updated last year
- The official repo of continuous speculative decodingβ32Mar 28, 2025Updated last year