microsoft / RetrievalAttentionLinks

Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.

☆94

Alternatives and similar repositories for RetrievalAttention

Users that are interested in RetrievalAttention are comparing it to the libraries listed below

Sorting:

HugoZHL / PQCache
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆73Updated this week
YaoJiayi / CacheBlend
☆141Updated 3 months ago
WukLab / preble
Stateful LLM Serving
☆87Updated 7 months ago
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆188Updated last year
hao-ai-lab / MuxServe
☆74Updated last week
LoongServe / LoongServe
☆124Updated 11 months ago
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆51Updated this week
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆155Updated last year
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆60Updated 11 months ago
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆67Updated 4 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆138Updated 9 months ago
xinhao-luo / ClusterFusion
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆47Updated last month
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆61Updated 3 weeks ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 7 months ago
amazon-science / piperag
PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)
☆26Updated last year
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆130Updated last year
pku-liang / ArkVale
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)
☆43Updated 10 months ago
alibaba / ServeGen
A framework for generating realistic LLM serving workloads
☆71Updated 2 weeks ago
UChi-JCL / CacheGen
☆136Updated last year
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 2 months ago
microsoft / chunk-attention
☆78Updated 6 months ago
ruipeterpan / marconi
Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]
☆40Updated 7 months ago
alibaba-edu / qwen-bailian-usagetraces-anon
☆55Updated 4 months ago
zcli-charlie / Awesome-KV-Cache
☆80Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
eddiegaoo / Apt-Serve
☆18Updated 4 months ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆139Updated last month
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆117Updated last month
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
kvcache-ai / TrEnv-X
☆63Updated last month