hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β854Updated this week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ791Updated 11 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β299Updated 3 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β500Updated this week
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,197Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,439Updated last week
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β343Updated 5 months ago
- A curated list for Efficient Large Language Modelsβ1,802Updated last month
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β521Updated 10 months ago
- Awesome LLM compression research papers and tools.β1,618Updated last month
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β462Updated last year
- Awesome list for LLM pruning.β246Updated 7 months ago
- β605Updated 2 months ago
- β331Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β654Updated 3 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β43Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β730Updated 4 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β172Updated this week
- Ring attention implementation with flash attentionβ828Updated last week
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ311Updated 3 weeks ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β198Updated 5 months ago
- Curated collection of papers in MoE model inferenceβ220Updated this week
- Awesome list for LLM quantizationβ260Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ537Updated 2 weeks ago
- β23Updated 4 months ago
- Paper list for Efficient Reasoning.β573Updated this week
- Explorations into some recent techniques surrounding speculative decodingβ275Updated 7 months ago
- β268Updated 3 weeks ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ312Updated 6 months ago
- Survey Paper List - Efficient LLM and Foundation Modelsβ253Updated 10 months ago
- slime is a LLM post-training framework aiming for RL Scaling.β975Updated this week