hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β988Updated this week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ841Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β320Updated 6 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β566Updated 3 weeks ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,221Updated 4 months ago
- Awesome LLM compression research papers and tools.β1,690Updated 3 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β376Updated 7 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β482Updated last year
- A curated list for Efficient Large Language Modelsβ1,874Updated 4 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β567Updated last year
- Awesome list for LLM pruning.β267Updated 2 weeks ago
- β343Updated last year
- β609Updated 5 months ago
- Curated collection of papers in MoE model inferenceβ285Updated last month
- Ring attention implementation with flash attentionβ903Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β1,884Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β770Updated 7 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ338Updated 3 months ago
- Awesome list for LLM quantizationβ326Updated 2 weeks ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β439Updated this week
- β284Updated 3 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,072Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β47Updated 7 months ago
- Disaggregated serving system for Large Language Models (LLMs).β709Updated 6 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β205Updated 8 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ329Updated last month
- Explorations into some recent techniques surrounding speculative decodingβ288Updated 10 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β221Updated 2 months ago
- Paper list for Efficient Reasoning.β703Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ582Updated last week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,141Updated 3 weeks ago