feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β829Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ958Updated 2 weeks ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β314Updated 5 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β476Updated last year
- LongBench v2 and LongBench (ACL 25'&24')β977Updated 8 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β559Updated last year
- Ring attention implementation with flash attentionβ885Updated 3 weeks ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β546Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ571Updated 2 weeks ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β992Updated 9 months ago
- β338Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,219Updated 3 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,066Updated 11 months ago
- Best practice for training LLaMA models in Megatron-LMβ661Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,282Updated 6 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β1,856Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β412Updated this week
- Awesome list for LLM pruning.β263Updated this week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ631Updated last year
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β46Updated 6 months ago
- Disaggregated serving system for Large Language Models (LLMs).β697Updated 5 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,133Updated this week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β202Updated 7 months ago
- A powerful toolkit for compressing large models including LLM, VLM, and video generation models.β576Updated last month
- A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarksβ268Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β760Updated 7 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,361Updated this week
- β608Updated 4 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ339Updated 2 months ago
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ345Updated 7 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ428Updated this week