galeselee/Awesome_LLM_System-PaperList

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/galeselee/Awesome_LLM_System-PaperList)

galeselee / Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!

☆283

Alternatives and similar repositories for Awesome_LLM_System-PaperList

Users that are interested in Awesome_LLM_System-PaperList are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

AmadeusChan / Awesome-LLM-System-Papers
View on GitHub
☆645Jan 14, 2026Updated 6 months ago
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆502Jun 10, 2026Updated last month
LLMServe / DistServe
View on GitHub
Disaggregated serving system for Large Language Models (LLMs).
☆825Apr 6, 2025Updated last year
lambda7xx / awesome-AI-system
View on GitHub
paper and its code for AI System
☆374May 14, 2026Updated 2 months ago
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆509Jan 8, 2026Updated 6 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
HPMLL / BurstGPT
View on GitHub
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
☆277Jun 30, 2026Updated 2 weeks ago
AmberLJC / LLMSys-PaperList
View on GitHub
Large Language Model (LLM) Systems Paper List
☆2,184Updated this week
llumnix-project / llumnix-ray
View on GitHub
Efficient and easy multi-instance LLM serving
☆562Mar 12, 2026Updated 4 months ago
LoongServe / LoongServe
View on GitHub
☆135Nov 11, 2024Updated last year
microsoft / ParrotServe
View on GitHub
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆222Sep 21, 2024Updated last year
microsoft / vidur
View on GitHub
Accurate, large-scale, and extensible simulator for LLM inference Systems
☆641Jul 25, 2025Updated 11 months ago
chenhongyu2048 / LLM-inference-optimization-paper
View on GitHub
Summary of some awesome work for optimizing LLM inference
☆259Feb 14, 2026Updated 5 months ago
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆154Dec 4, 2024Updated last year
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆724Apr 15, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
feifeibear / LLMRoofline
View on GitHub
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆123Mar 13, 2024Updated 2 years ago
byungsoo-oh / ml-systems-papers
View on GitHub
Curated collection of papers in machine learning systems
☆630Feb 7, 2026Updated 5 months ago
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,392Jun 23, 2026Updated 3 weeks ago
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,342Aug 28, 2025Updated 10 months ago
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆967Mar 29, 2026Updated 3 months ago
microsoft / mscclpp
View on GitHub
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆538Updated this week
interestingLSY / swiftLLM
View on GitHub
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆329Jun 10, 2025Updated last year
mutinifni / splitwise-sim
View on GitHub
LLM serving cluster simulator
☆157Apr 25, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
kvcache-ai / Mooncake
View on GitHub
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
☆5,834Updated this week
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆279Jul 1, 2025Updated last year
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆242Jan 20, 2026Updated 5 months ago
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆68Mar 25, 2025Updated last year
SNU-ARC / any-precision-llm
View on GitHub
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆129Jul 4, 2025Updated last year
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,962Updated this week
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆32Updated this week
thustorage / Medusa
View on GitHub
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
☆47May 13, 2025Updated last year
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆399Jul 10, 2025Updated last year
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
hahnyuan / LLM-Viewer
View on GitHub
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆661Sep 11, 2024Updated last year
horseee / Awesome-Efficient-LLM
View on GitHub
A curated list for Efficient Large Language Models
☆2,023Jun 17, 2025Updated last year
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆193Feb 11, 2026Updated 5 months ago
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,139Jul 8, 2026Updated last week
HPMLL / NVIDIA-Hopper-Benchmark
View on GitHub
☆113May 31, 2025Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆416Nov 20, 2025Updated 7 months ago
antgroup / glake
View on GitHub
GLake: optimizing GPU memory management and IO transmission.
☆501Mar 24, 2025Updated last year