Fast inference from large lauguage models via speculative decoding
β888Aug 22, 2024Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,126Jan 24, 2026Updated last month
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,708Jun 25, 2024Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β214Sep 11, 2025Updated 5 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β369Apr 22, 2025Updated 10 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Feb 13, 2025Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,201Feb 20, 2026Updated last week
- Multi-Candidate Speculative Decodingβ39Apr 22, 2024Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ108Feb 29, 2024Updated 2 years ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,315Mar 6, 2025Updated 11 months ago
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,863Updated this week
- πA curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.πβ5,022Updated this week
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β504Aug 1, 2024Updated last year
- FlashInfer: Kernel Library for LLM Servingβ5,009Feb 23, 2026Updated last week
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ277Aug 31, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β816Mar 6, 2025Updated 11 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ147Dec 23, 2025Updated 2 months ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β4,843Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,441Jul 17, 2025Updated 7 months ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β3,919Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ644Jan 15, 2026Updated last month
- β301Jul 10, 2025Updated 7 months ago
- β155Mar 4, 2025Updated 11 months ago
- β28May 24, 2025Updated 9 months ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ97Feb 6, 2024Updated 2 years ago
- flash attention tutorial written in python, triton, cuda, cutlassβ488Jan 20, 2026Updated last month
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ524Feb 10, 2025Updated last year
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Accelerationβ62Feb 21, 2025Updated last year
- Transformer related optimization, including BERT, GPTβ6,394Mar 27, 2024Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ336Jul 2, 2024Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,612Jul 12, 2024Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ144Dec 4, 2024Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β777Apr 6, 2025Updated 10 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ358Nov 20, 2025Updated 3 months ago
- β352Apr 2, 2024Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β661Updated this week
- β64Dec 3, 2024Updated last year
- An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)β9,037Feb 21, 2026Updated last week
- Quantized Attention on GPUβ44Nov 22, 2024Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ374Jul 10, 2025Updated 7 months ago