feifeibear / LLMSpeculativeSamplingLinks

Fast inference from large lauguage models via speculative decoding

☆791

Alternatives and similar repositories for LLMSpeculativeSampling

Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below

Sorting:

hemingkx / SpeculativeDecodingPapers
📰 Must-read papers and blogs on Speculative Decoding ⚡️
☆854Updated this week
hemingkx / Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
☆299Updated 3 months ago
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆462Updated last year
SafeAILab / EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
☆1,439Updated last week
THUDM / LongBench
LongBench v2 and LongBench (ACL 25'&24')
☆936Updated 6 months ago
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆521Updated 10 months ago
pjlab-sys4nlp / llama-moe
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
☆977Updated 7 months ago
FMInference / DejaVu
☆331Updated last year
October2001 / Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆500Updated this week
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆537Updated 2 weeks ago
horseee / LLM-Pruner
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baich…
☆1,050Updated 9 months ago
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
AIoT-MLSys-Lab / Efficient-LLMs-Survey
[TMLR 2024] Efficient Large Language Models: A Survey
☆1,197Updated last month
alibaba / Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
☆659Updated last year
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,263Updated 4 months ago
LLMServe / DistServe
Disaggregated serving system for Large Language Models (LLMs).
☆654Updated 3 months ago
princeton-nlp / LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆626Updated last year
pprp / Awesome-LLM-Prune
Awesome list for LLM pruning.
☆246Updated 7 months ago
THUDM / slime
slime is a LLM post-training framework aiming for RL Scaling.
☆975Updated this week
dilab-zju / self-speculative-decoding
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
☆198Updated 5 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 4 months ago
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,080Updated last week
HArmonizedSS / HASS
Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)
☆43Updated 4 months ago
alibaba / Pai-Megatron-Patch
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
☆1,258Updated 3 weeks ago
horseee / Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
☆1,802Updated last month
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆305Updated 5 months ago
Strivin0311 / long-llms-learning
A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarks
☆265Updated last year
HuangOwen / Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
☆1,618Updated last month
pprp / Awesome-LLM-Quantization
Awesome list for LLM quantization
☆260Updated last month
TUDB-Labs / mLoRA
An Efficient "Factory" to Build Multiple LoRA Adapters
☆330Updated 5 months ago