alipay/PainlessInferenceAcceleration

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/alipay/PainlessInferenceAcceleration)

alipay / PainlessInferenceAcceleration

Accelerate inference without tears

☆371

Alternatives and similar repositories for PainlessInferenceAcceleration

Users that are interested in PainlessInferenceAcceleration are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

hao-ai-lab / LookaheadDecoding
View on GitHub
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,339Mar 6, 2025Updated last year
inclusionAI / flood
View on GitHub
☆15Feb 26, 2026Updated 5 months ago
FasterDecoding / REST
View on GitHub
REST: Retrieval-Based Speculative Decoding, NAACL 2024
☆220Mar 5, 2026Updated 4 months ago
hpcaitech / SwiftInfer
View on GitHub
Efficient AI Inference & Serving
☆478Jan 8, 2024Updated 2 years ago
FasterDecoding / Medusa
View on GitHub
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,761Jun 25, 2024Updated 2 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
thunlp / Ouroboros
View on GitHub
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆117Mar 20, 2025Updated last year
apoorvumang / prompt-lookup-decoding
View on GitHub
Simple speculative decoding technique, integrated in vLLM and transformers
☆611Aug 23, 2024Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,224Apr 8, 2026Updated 3 months ago
inclusionAI / linghe
View on GitHub
A high-performance kernel library for LLM training
☆86Apr 28, 2026Updated 3 months ago
SafeAILab / EAGLE
View on GitHub
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
☆2,481Feb 20, 2026Updated 5 months ago
inclusionAI / AWorld
View on GitHub
Search, understand, reproduce, and improve an idea with ease
☆1,215Updated this week
alibaba / rtp-llm
View on GitHub
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
☆1,289Updated this week
ModelTC / LightLLM
View on GitHub
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆4,198Updated this week
hemingkx / SpeculativeDecodingPapers
View on GitHub
📰 Must-read papers and blogs on Speculative Decoding ⚡️
☆1,283Jun 27, 2026Updated last month
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆971Mar 29, 2026Updated 4 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,113Sep 4, 2024Updated last year
flexflow / flexflow-train
View on GitHub
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,898Updated this week
MayDomine / Burst-Attention
View on GitHub
Distributed IO-aware Attention algorithm
☆24Sep 24, 2025Updated 10 months ago
Tencent / KsanaLLM
View on GitHub
☆546Jul 14, 2026Updated 2 weeks ago
inferflow / inferflow
View on GitHub
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆251Mar 15, 2024Updated 2 years ago
sgl-project / SpecForge
View on GitHub
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆1,027Updated this week
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,425Updated this week
smart-lty / ParallelSpeculativeDecoding
View on GitHub
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆170Dec 23, 2025Updated 7 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
Jikai0Wang / OPT-Tree
View on GitHub
☆30May 24, 2025Updated last year
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆156Dec 4, 2024Updated last year
hao-ai-lab / Consistency_LLM
View on GitHub
[ICML 2024] CLLMs: Consistency Large Language Models
☆415Nov 16, 2024Updated last year
NetEase-FuXi / EETQ
View on GitHub
Easy and Efficient Quantization for Transformers
☆205Mar 25, 2026Updated 4 months ago
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
modelscope / dash-infer
View on GitHub
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆273Aug 6, 2025Updated 11 months ago
S-LoRA / S-LoRA
View on GitHub
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,920Jan 21, 2024Updated 2 years ago
feifeibear / Odysseus-Transformer
View on GitHub
Odysseus: Playground of LLM Sequence Parallelism
☆83Jun 17, 2024Updated 2 years ago
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆14,245Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆511Jan 8, 2026Updated 6 months ago
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,076Apr 11, 2025Updated last year
deepspeedai / DeepSpeed-MII
View on GitHub
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,108Jun 30, 2025Updated last year
OpenPPL / ppl.nn.llm
View on GitHub
☆140Apr 23, 2024Updated 2 years ago
kvcache-ai / Mooncake
View on GitHub
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
☆6,083Updated this week
linfeng93 / BiTA
View on GitHub
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
☆29Apr 15, 2025Updated last year
SkyworkAI / Skywork-MoE
View on GitHub
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
☆140Jun 12, 2024Updated 2 years ago