hemingkx / SpeculativeDecodingPapersLinks
π° Must-read papers and blogs on Speculative Decoding β‘οΈ
β755Updated last week
Alternatives and similar repositories for SpeculativeDecodingPapers
Users that are interested in SpeculativeDecodingPapers are comparing it to the libraries listed below
Sorting:
- Fast inference from large lauguage models via speculative decodingβ745Updated 9 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β269Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,277Updated this week
- π° Must-read papers on KV Cache Compression (constantly updating π€).β431Updated last week
- β587Updated 3 weeks ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β471Updated 8 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β448Updated 10 months ago
- β322Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,161Updated 2 months ago
- Ring attention implementation with flash attentionβ771Updated 2 weeks ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β304Updated 3 months ago
- A curated list for Efficient Large Language Modelsβ1,694Updated last month
- Explorations into some recent techniques surrounding speculative decodingβ266Updated 5 months ago
- Awesome LLM compression research papers and tools.β1,539Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β686Updated 2 months ago
- Awesome list for LLM pruning.β230Updated 5 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ506Updated last week
- Disaggregated serving system for Large Language Models (LLMs).β601Updated last month
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ421Updated last month
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β186Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ291Updated 6 months ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.β390Updated 6 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ611Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,249Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ384Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β831Updated 9 months ago
- A PyTorch Native LLM Training Frameworkβ811Updated 5 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ355Updated 9 months ago
- β248Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ814Updated 3 weeks ago