hemingkx / SpeculativeDecodingPapers
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β714Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ723Updated 8 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,220Updated last week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β263Updated 3 weeks ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β400Updated last week
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β444Updated 9 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β292Updated 2 months ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,149Updated last month
- Explorations into some recent techniques surrounding speculative decodingβ262Updated 4 months ago
- β319Updated last year
- Ring attention implementation with flash attentionβ759Updated last month
- β578Updated 2 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β454Updated 8 months ago
- A curated list for Efficient Large Language Modelsβ1,651Updated 3 weeks ago
- Awesome LLM compression research papers and tools.β1,502Updated this week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β185Updated 3 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ492Updated 3 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ279Updated 5 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β666Updated 2 months ago
- β241Updated last year
- Awesome list for LLM pruning.β224Updated 5 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ608Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β580Updated last month
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,246Updated 2 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ295Updated 3 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β201Updated 5 months ago
- Paper list for Efficient Reasoning.β425Updated this week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ348Updated 9 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,013Updated last week
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β653Updated last month
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.β384Updated 5 months ago