hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,110Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ884Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β357Updated 9 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β649Updated 4 months ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,250Updated 7 months ago
- Awesome LLM compression research papers and tools.β1,771Updated 2 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β411Updated 11 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β503Updated last year
- A curated list for Efficient Large Language Modelsβ1,949Updated 7 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β615Updated last year
- Awesome list for LLM pruning.β281Updated 3 months ago
- β353Updated last year
- β626Updated 3 weeks ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β676Updated this week
- Awesome list for LLM quantizationβ384Updated 3 months ago
- Curated collection of papers in MoE model inferenceβ339Updated 3 months ago
- Ring attention implementation with flash attentionβ973Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β807Updated 10 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Updated 11 months ago
- β302Updated 6 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,169Updated last week
- Disaggregated serving system for Large Language Models (LLMs).β772Updated 9 months ago
- Explorations into some recent techniques surrounding speculative decodingβ299Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ370Updated 6 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β52Updated 10 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,105Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ357Updated 2 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β278Updated 2 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ637Updated 3 weeks ago
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.β672Updated 2 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,180Updated 4 months ago