antgroup/cakekv

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/antgroup/cakekv)

antgroup / cakekv

☆39

Alternatives and similar repositories for cakekv

Users that are interested in cakekv are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆139Nov 26, 2025Updated 8 months ago
Ketonmi / Awesome-Large-Scale-LLM-Serving
View on GitHub
Must-read papers on improving efficiency for LLM serving clusters
☆34May 28, 2025Updated last year
FasterDecoding / SnapKV
View on GitHub
☆326Jul 10, 2025Updated last year
FFY0 / DefensiveKV
View on GitHub
Official Implementation for [ICLR26] DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference
☆56Mar 28, 2026Updated 4 months ago
AIoT-MLSys-Lab / D2O
View on GitHub
[ICLR 2025🔥] D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
☆27Jul 7, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆728Apr 15, 2026Updated 3 months ago
HugoZHL / PQCache
View on GitHub
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆91Dec 7, 2025Updated 7 months ago
Linking-ai / SCOPE
View on GitHub
(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
☆36May 28, 2025Updated last year
apple / ml-epicache
View on GitHub
☆30Oct 2, 2025Updated 9 months ago
snu-mllab / KVzip
View on GitHub
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆225Feb 11, 2026Updated 5 months ago
mutonix / pyramidinfer
View on GitHub
☆47Nov 25, 2024Updated last year
tsinghua-ideal / Twilight
View on GitHub
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆105Jul 8, 2026Updated 3 weeks ago
JingyangYi / ShorterBetter
View on GitHub
☆18Jul 31, 2025Updated last year
Infini-AI-Lab / MagicPIG
View on GitHub
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆255Dec 16, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
sjtu-zhao-lab / ClusterKV
View on GitHub
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)
☆32Feb 26, 2026Updated 5 months ago
Scientific-Computing-Lab / STREAMer
View on GitHub
STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth
☆18Aug 21, 2023Updated 2 years ago
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆170Oct 13, 2025Updated 9 months ago
antgroup / OmniKV
View on GitHub
Dynamic Context Selection for Efficient Long-Context LLMs
☆63May 20, 2025Updated last year
DerrickYLJ / TidalDecode
View on GitHub
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆57Aug 6, 2025Updated 11 months ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆402Jul 10, 2025Updated last year
DerekHJH / epic
View on GitHub
☆21Jul 13, 2026Updated 2 weeks ago
Osilly / dynamic_llava
View on GitHub
[ICLR 2025] The official pytorch implement of "Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Cont…
☆72Sep 18, 2025Updated 10 months ago
pzs19 / TokenSelect
View on GitHub
☆20Mar 11, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,224Apr 8, 2026Updated 3 months ago
NonvolatileMemory / GliDe_with_a_CaPE_ICML_24
View on GitHub
official code for GliDe with a CaPE
☆22Aug 13, 2024Updated last year
thu-nics / PM-KVQ
View on GitHub
The official code implementation for paper "PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs"
☆29May 24, 2025Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆423Nov 20, 2025Updated 8 months ago
snu-mllab / Context-Memory
View on GitHub
Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)
☆63Apr 18, 2024Updated 2 years ago
horseee / CoT-Valve
View on GitHub
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
☆91Feb 14, 2025Updated last year
stonet-research / cheops25-IO-characterization-of-LLM-model-kv-cache-offloading-nvme
View on GitHub
☆19Apr 15, 2025Updated last year
mit-han-lab / Block-Sparse-Attention
View on GitHub
A sparse attention kernel supporting mix sparse patterns
☆540Jan 18, 2026Updated 6 months ago
ByteDance-Seed / ShadowKV
View on GitHub
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
☆311May 1, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ZBox1005 / CoT-UQ
View on GitHub
[ACL 2025] "CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought"
☆17Apr 3, 2025Updated last year
alessiodevoto / l2compress
View on GitHub
Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."
☆19Dec 13, 2024Updated last year
nightdessert / Retrieval_Head
View on GitHub
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆241Aug 2, 2024Updated last year
Zefan-Cai / KVCache-Factory
View on GitHub
Unified KV Cache Compression Methods for Auto-Regressive Models
☆1,355Jul 10, 2026Updated 3 weeks ago
TreeAI-Lab / Awesome-KV-Cache-Management
View on GitHub
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆343Jul 16, 2026Updated 2 weeks ago
SuDIS-ZJU / Efficient-LVLMs-Inference
View on GitHub
[ACL 2026 Findings] Living repository for the survey paper “Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques…
☆26Apr 8, 2026Updated 3 months ago
PanZaifeng / KVFlow
View on GitHub
☆28Mar 12, 2026Updated 4 months ago