sjtu-zhao-lab / ClusterKVLinks

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)

☆16

Alternatives and similar repositories for ClusterKV

Users that are interested in ClusterKV are comparing it to the libraries listed below

Sorting:

pku-liang / ArkVale
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)
☆44Updated 11 months ago
xinhao-luo / ClusterFusion
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆49Updated last month
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆161Updated last year
LoongServe / LoongServe
☆124Updated last year
DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆63Updated this week
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆67Updated last week
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 7 months ago
ranggihwang / Pregated_MoE
☆57Updated last year
YaoJiayi / CacheBlend
☆151Updated 4 months ago
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆69Updated 5 months ago
d-matrix-ai / keyformer-llm
☆58Updated last year
gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆49Updated this week
TreeAI-Lab / Awesome-KV-Cache-Management
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆242Updated 3 months ago
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆99Updated 2 months ago
HugoZHL / PQCache
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆76Updated 2 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆187Updated last month
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆89Updated 5 months ago
hao-ai-lab / MuxServe
☆79Updated last month
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆143Updated 2 months ago
EfficientLLMSys / MuxServe
☆14Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆223Updated 2 years ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆138Updated 3 weeks ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆80Updated last week
SJTU-ReArch-Group / Paper-Reading-List
☆137Updated this week
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆288Updated 5 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
zcli-charlie / Awesome-KV-Cache
☆82Updated last year
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆113Updated last month
nex-agi / NexVenusCL
Nex Venus Communication Library
☆50Updated this week
microsoft / SparTA
☆159Updated last year