feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β872Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,061Updated 2 weeks ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β344Updated 8 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β494Updated last year
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β1,004Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,240Updated 6 months ago
- LongBench v2 and LongBench (ACL 25'&24')β1,050Updated 11 months ago
- Ring attention implementation with flash attentionβ949Updated 3 months ago
- β351Updated last year
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β599Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β627Updated 2 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ618Updated this week
- Best practice for training LLaMA models in Megatron-LMβ664Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,310Updated 9 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,089Updated last year
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β577Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,169Updated 2 months ago
- Disaggregated serving system for Large Language Models (LLMs).β754Updated 8 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,078Updated last week
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,489Updated last week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ635Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β213Updated 10 months ago
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.β643Updated last month
- β296Updated 5 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β404Updated 9 months ago
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ361Updated 10 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ446Updated 2 months ago
- Awesome list for LLM pruning.β279Updated 2 months ago
- β612Updated 7 months ago
- Explorations into some recent techniques surrounding speculative decodingβ295Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β52Updated 9 months ago