feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β791Updated 11 months ago
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ854Updated this week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β299Updated 3 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β462Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,439Updated last week
- LongBench v2 and LongBench (ACL 25'&24')β936Updated 6 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β521Updated 10 months ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β977Updated 7 months ago
- β331Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β500Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ537Updated 2 weeks ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,050Updated 9 months ago
- Ring attention implementation with flash attentionβ828Updated last week
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,197Updated last month
- Best practice for training LLaMA models in Megatron-LMβ659Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,263Updated 4 months ago
- Disaggregated serving system for Large Language Models (LLMs).β654Updated 3 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ626Updated last year
- Awesome list for LLM pruning.β246Updated 7 months ago
- slime is a LLM post-training framework aiming for RL Scaling.β975Updated this week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β198Updated 5 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β730Updated 4 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,080Updated last week
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β43Updated 4 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,258Updated 3 weeks ago
- A curated list for Efficient Large Language Modelsβ1,802Updated last month
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β305Updated 5 months ago
- A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarksβ265Updated last year
- Awesome LLM compression research papers and tools.β1,618Updated last month
- Awesome list for LLM quantizationβ260Updated last month
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ330Updated 5 months ago