hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β1,040Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ860Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β338Updated 7 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β614Updated 2 months ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,236Updated 5 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β393Updated 9 months ago
- Awesome LLM compression research papers and tools.β1,725Updated 3 weeks ago
- A curated list for Efficient Large Language Modelsβ1,910Updated 5 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β487Updated last year
- Awesome list for LLM pruning.β276Updated last month
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β589Updated last year
- β610Updated 6 months ago
- β348Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,035Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β790Updated 9 months ago
- Disaggregated serving system for Large Language Models (LLMs).β737Updated 8 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β523Updated this week
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ356Updated 4 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β51Updated 8 months ago
- Curated collection of papers in MoE model inferenceβ308Updated last month
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β255Updated 4 months ago
- Explorations into some recent techniques surrounding speculative decodingβ295Updated 11 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,083Updated last year
- β290Updated 4 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ339Updated 2 weeks ago
- Awesome list for LLM quantizationβ365Updated last month
- Ring attention implementation with flash attentionβ923Updated 2 months ago
- π° Must-read papers and blogs on LLM based Long Context Modeling π₯β1,837Updated last week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β212Updated 9 months ago
- Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.β90Updated last year
- Paper list for Efficient Reasoning.β746Updated this week