hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β800Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β282Updated 2 months ago
- Fast inference from large lauguage models via speculative decodingβ762Updated 10 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,325Updated 2 weeks ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β459Updated this week
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,172Updated this week
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β454Updated 10 months ago
- β595Updated last month
- Awesome LLM compression research papers and tools.β1,567Updated last week
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β321Updated 3 months ago
- A curated list for Efficient Large Language Modelsβ1,736Updated last week
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β487Updated 9 months ago
- β328Updated last year
- Ring attention implementation with flash attentionβ789Updated last week
- Awesome list for LLM pruning.β232Updated 6 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ519Updated 3 weeks ago
- Explorations into some recent techniques surrounding speculative decodingβ269Updated 6 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β705Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ297Updated 7 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,258Updated 3 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ303Updated 5 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,055Updated last week
- paper and its code for AI Systemβ311Updated 2 months ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β967Updated 6 months ago
- β256Updated last year
- Puzzles for learning Triton, play it with minimal environment configuration!β367Updated 6 months ago
- Paper list for Efficient Reasoning.β509Updated this week
- Disaggregated serving system for Large Language Models (LLMs).β617Updated 2 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β190Updated 4 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ617Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β846Updated 9 months ago