feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β886Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,117Updated 2 weeks ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β503Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β363Updated 9 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β615Updated last year
- LongBench v2 and LongBench (ACL 25'&24')β1,081Updated last year
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β1,005Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β649Updated 4 months ago
- Ring attention implementation with flash attentionβ979Updated 4 months ago
- Best practice for training LLaMA models in Megatron-LMβ664Updated 2 years ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,524Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ641Updated 3 weeks ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,253Updated 7 months ago
- β353Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,316Updated 11 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β676Updated this week
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.β675Updated 2 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ447Updated 3 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Updated 11 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,105Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β771Updated 10 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β411Updated 11 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ639Updated last year
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,180Updated 4 months ago
- Awesome list for LLM pruning.β281Updated 3 months ago
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ370Updated 11 months ago
- β303Updated 6 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,169Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β810Updated 11 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ478Updated 9 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β323Updated 11 months ago