feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β886Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,117Updated 2 weeks ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β503Updated last year
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β363Updated 9 months ago
- LongBench v2 and LongBench (ACL 25'&24')β1,081Updated last year
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β1,005Updated last year
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β617Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ641Updated 3 weeks ago
- Ring attention implementation with flash attentionβ979Updated 4 months ago
- β353Updated last year
- π° Must-read papers on KV Cache Compression (constantly updating π€).β658Updated 4 months ago
- Best practice for training LLaMA models in Megatron-LMβ664Updated 2 years ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,105Updated last year
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,524Updated last month
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β676Updated last week
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,253Updated 7 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,316Updated 11 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ639Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Updated 11 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ447Updated 3 months ago
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.β675Updated 2 months ago
- Awesome list for LLM pruning.β282Updated 3 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β411Updated 11 months ago
- Disaggregated serving system for Large Language Models (LLMs).β771Updated 10 months ago
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ370Updated 11 months ago
- A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarksβ272Updated last year
- Awesome LLM compression research papers and tools.β1,771Updated 2 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,180Updated 4 months ago
- β628Updated 3 weeks ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β322Updated 11 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β54Updated 10 months ago