hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β890Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ807Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β306Updated 4 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β517Updated 3 weeks ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,490Updated this week
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,200Updated 2 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β351Updated 5 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β467Updated last year
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β537Updated 11 months ago
- A curated list for Efficient Large Language Modelsβ1,844Updated 2 months ago
- Awesome LLM compression research papers and tools.β1,643Updated last month
- Awesome list for LLM pruning.β251Updated this week
- β608Updated 3 months ago
- β332Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β45Updated 5 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β183Updated 3 weeks ago
- Disaggregated serving system for Large Language Models (LLMs).β669Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β735Updated 5 months ago
- Ring attention implementation with flash attentionβ841Updated 3 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ323Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ549Updated last month
- Explorations into some recent techniques surrounding speculative decodingβ282Updated 8 months ago
- π° Must-read papers and blogs on LLM based Long Context Modeling π₯β1,670Updated this week
- β273Updated last month
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,053Updated 10 months ago
- Curated collection of papers in MoE model inferenceβ235Updated 3 weeks ago
- Awesome list for LLM quantizationβ279Updated this week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β201Updated 6 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ315Updated 7 months ago
- LongBench v2 and LongBench (ACL 25'&24')β951Updated 7 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,105Updated 2 weeks ago