TreeAI-Lab/Awesome-KV-Cache-Management

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TreeAI-Lab/Awesome-KV-Cache-Management)

TreeAI-Lab / Awesome-KV-Cache-Management

This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding code links.

☆339

Alternatives and similar repositories for Awesome-KV-Cache-Management

Users that are interested in Awesome-KV-Cache-Management are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Zefan-Cai / Awesome-LLM-KV-Cache
View on GitHub
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
☆460Jun 17, 2026Updated last month
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆725Apr 15, 2026Updated 3 months ago
NVIDIA / kvpress
View on GitHub
LLM KV cache compression made easy
☆1,142Jul 9, 2026Updated last week
zcli-charlie / Awesome-KV-Cache
View on GitHub
☆88Oct 9, 2024Updated last year
chenhongyu2048 / LLM-inference-optimization-paper
View on GitHub
Summary of some awesome work for optimizing LLM inference
☆261Feb 14, 2026Updated 5 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
FasterDecoding / SnapKV
View on GitHub
☆324Jul 10, 2025Updated last year
HugoZHL / PQCache
View on GitHub
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆91Dec 7, 2025Updated 7 months ago
NEO-MLSys25 / NEO
View on GitHub
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆99Jun 16, 2025Updated last year
Graph-COM / GraphKV
View on GitHub
☆17Jun 30, 2025Updated last year
HarryWu99 / llm_kvcache_sparsity
View on GitHub
Implement some method of LLM KV Cache Sparsity
☆41Jun 6, 2024Updated 2 years ago
microsoft / RetrievalAttention
View on GitHub
[VLDB 26, NeurIPS 25] Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆147Feb 22, 2026Updated 4 months ago
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,404Jun 23, 2026Updated 3 weeks ago
FYYFU / HeadKV
View on GitHub
[ICLR2025] Code and data for paper: Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasonin…
☆44Mar 10, 2025Updated last year
cmd2001 / KVTuner
View on GitHub
[ICML2025] KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
☆29Jan 27, 2026Updated 5 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
hemingkx / SpeculativeDecodingPapers
View on GitHub
📰 Must-read papers and blogs on Speculative Decoding ⚡️
☆1,276Jun 27, 2026Updated 3 weeks ago
antgroup / cakekv
View on GitHub
☆39Mar 17, 2025Updated last year
MoE-Inf / awesome-moe-inference
View on GitHub
Curated collection of papers in MoE model inference
☆408Mar 12, 2026Updated 4 months ago
alibaba-edu / qwen-bailian-usagetraces-anon
View on GitHub
☆150Apr 23, 2026Updated 2 months ago
ByteDance-Seed / ShadowKV
View on GitHub
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
☆310May 1, 2025Updated last year
UChi-JCL / CacheGen
View on GitHub
☆169Oct 9, 2024Updated last year
sjtu-zhao-lab / ClusterKV
View on GitHub
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)
☆32Feb 26, 2026Updated 4 months ago
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆504Updated this week
kvcache-ai / Mooncake
View on GitHub
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
☆5,925Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
MIRALab-USTC / LLM-AttentionPredictor
View on GitHub
The code for "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Ch…
☆29Jul 15, 2025Updated last year
snu-comparch / InfiniGen
View on GitHub
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆192Jul 10, 2024Updated 2 years ago
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆511Jan 8, 2026Updated 6 months ago
horseee / Awesome-Efficient-LLM
View on GitHub
A curated list for Efficient Large Language Models
☆2,023Jun 17, 2025Updated last year
OpenBitSys / BitDecoding
View on GitHub
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆96May 14, 2026Updated 2 months ago
chenyu-jiang / dcp
View on GitHub
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
☆21Nov 28, 2025Updated 7 months ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
pku-liang / ArkVale
View on GitHub
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)
☆54Dec 17, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Geralt-Targaryen / Awesome-Speculative-Decoding
View on GitHub
Reading notes on Speculative Decoding papers
☆41Jun 2, 2026Updated last month
tsinghua-ideal / Twilight
View on GitHub
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆105Jul 8, 2026Updated last week
NVlabs / RocketKV
View on GitHub
[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
☆49Aug 7, 2025Updated 11 months ago
attention-survey / Efficient_Attention_Survey
View on GitHub
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
☆304Dec 1, 2025Updated 7 months ago
YaoJiayi / CacheBlend
View on GitHub
☆198Jul 15, 2025Updated last year
HuangOwen / Awesome-LLM-Compression
View on GitHub
Awesome LLM compression research papers and tools.
☆1,853Jun 30, 2026Updated 3 weeks ago
thu-nics / PM-KVQ
View on GitHub
The official code implementation for paper "PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs"
☆29May 24, 2025Updated last year