hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β917Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ815Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β310Updated 4 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β531Updated last month
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,212Updated 2 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β358Updated 6 months ago
- A curated list for Efficient Large Language Modelsβ1,860Updated 2 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β473Updated last year
- Awesome LLM compression research papers and tools.β1,658Updated 2 months ago
- β609Updated 4 months ago
- Awesome list for LLM pruning.β257Updated this week
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β547Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,635Updated this week
- β335Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β45Updated 6 months ago
- Ring attention implementation with flash attentionβ864Updated last month
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β196Updated last month
- Awesome list for LLM quantizationβ297Updated last week
- Curated collection of papers in MoE model inferenceβ255Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ333Updated 2 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β201Updated 7 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β753Updated 6 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β374Updated this week
- Disaggregated serving system for Large Language Models (LLMs).β685Updated 5 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ323Updated 7 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ557Updated last month
- Explorations into some recent techniques surrounding speculative decodingβ285Updated 8 months ago
- β25Updated 5 months ago
- β278Updated 2 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ450Updated 4 months ago
- A curated reading list of research in Mixture-of-Experts(MoE).β642Updated 10 months ago