Fast inference from large lauguage models via speculative decoding
β917Aug 22, 2024Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Explorations into some recent techniques surrounding speculative decodingβ307Dec 22, 2024Updated last year
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,259Jun 2, 2026Updated 3 weeks ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,751Jun 25, 2024Updated 2 years ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β219Mar 5, 2026Updated 3 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β398Apr 22, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β229Feb 13, 2025Updated last year
- Multi-Candidate Speculative Decodingβ41Apr 22, 2024Updated 2 years ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,418Feb 20, 2026Updated 4 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ111Feb 29, 2024Updated 2 years ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,337Mar 6, 2025Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,890Updated this week
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ281Aug 31, 2024Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ99Feb 6, 2024Updated 2 years ago
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ70Feb 21, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ165Dec 23, 2025Updated 6 months ago
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,355Jun 23, 2026Updated last week
- FlashInfer: Kernel Library for LLM Servingβ5,867Updated this week
- β30May 24, 2025Updated last year
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β523Aug 1, 2024Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β99Aug 20, 2023Updated 2 years ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β845Mar 6, 2025Updated last year
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,578Jul 17, 2025Updated 11 months ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β4,141Updated this week
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β5,672Updated this week
- β28Mar 14, 2024Updated 2 years ago
- β16Aug 19, 2024Updated last year
- β68Dec 3, 2024Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ673May 21, 2026Updated last month
- β321Jul 10, 2025Updated 11 months ago
- Transformer related optimization, including BERT, GPTβ6,433Mar 27, 2024Updated 2 years ago
- Disaggregated serving system for Large Language Models (LLMs).β822Apr 6, 2025Updated last year
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ539Feb 10, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- β157Mar 4, 2025Updated last year
- scalable and robust tree-based speculative decoding algorithmβ376Jan 28, 2025Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ342Jul 2, 2024Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β720Apr 15, 2026Updated 2 months ago
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β111Dec 2, 2024Updated last year
- β358Apr 2, 2024Updated 2 years ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ414Nov 20, 2025Updated 7 months ago