ByteDance-Seed/ShadowKV

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ByteDance-Seed/ShadowKV)

ByteDance-Seed / ShadowKV

[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

☆310

Alternatives and similar repositories for ShadowKV

Users that are interested in ShadowKV are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
Infini-AI-Lab / MagicPIG
View on GitHub
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆255Dec 16, 2024Updated last year
FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆139Nov 26, 2025Updated 7 months ago
Infini-AI-Lab / vortex_torch
View on GitHub
Vortex: Programmable Sparse Attention for Agents as Algorithm Designers
☆68Jun 24, 2026Updated 3 weeks ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆158Feb 5, 2025Updated last year
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆170Oct 13, 2025Updated 9 months ago
microsoft / RetrievalAttention
View on GitHub
[VLDB 26, NeurIPS 25] Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆147Feb 22, 2026Updated 4 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆539Feb 10, 2025Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,221Apr 8, 2026Updated 3 months ago
bytedance / ABQ-LLM
View on GitHub
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
☆246Sep 30, 2024Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
sail-sg / SimLayerKV
View on GitHub
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆54Oct 18, 2024Updated last year
DerrickYLJ / TidalDecode
View on GitHub
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆56Aug 6, 2025Updated 11 months ago
SqueezeAILab / SqueezedAttention
View on GitHub
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆58Nov 20, 2024Updated last year
microsoft / ParrotServe
View on GitHub
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆222Sep 21, 2024Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
snu-comparch / InfiniGen
View on GitHub
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆192Jul 10, 2024Updated 2 years ago
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆725Apr 15, 2026Updated 3 months ago
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
Zefan-Cai / KVCache-Factory
View on GitHub
Unified KV Cache Compression Methods for Auto-Regressive Models
☆1,352Jul 10, 2026Updated last week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
shadowpa0327 / Palu
View on GitHub
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆158Feb 20, 2025Updated last year
ByteDance-Seed / cudaLLM
View on GitHub
☆148Aug 18, 2025Updated 11 months ago
NVIDIA / kvpress
View on GitHub
LLM KV cache compression made easy
☆1,142Jul 9, 2026Updated last week
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆183Jul 12, 2024Updated 2 years ago
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆511Jan 8, 2026Updated 6 months ago
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆107Mar 26, 2025Updated last year
FasterDecoding / SnapKV
View on GitHub
☆324Jul 10, 2025Updated last year
Zanette-Labs / SpeculativeRejection
View on GitHub
[NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection
☆56Oct 29, 2024Updated last year
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆591Nov 7, 2025Updated 8 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,344Aug 28, 2025Updated 10 months ago
mit-han-lab / flash-moba
View on GitHub
☆250Nov 19, 2025Updated 8 months ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,494Updated this week
Qcompiler / MixQ_Tensorrt_LLM
View on GitHub
Mixed precision inference by Tensorrt-LLM
☆79Oct 23, 2024Updated last year
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆504Updated this week
MingXiangL / AttentionShift
View on GitHub
Official Implementation of AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation
☆155Oct 18, 2024Updated last year
FibonaccciYan / Adamas
View on GitHub
Adamas: Hadamard Sparse Attention for Efficient Long-context Inference
☆15May 19, 2026Updated 2 months ago