LLM KV cache compression made easy
β971Mar 13, 2026Updated last week
Alternatives and similar repositories for kvpress
Users that are interested in kvpress are comparing it to the libraries listed below
Sorting:
- π° Must-read papers on KV Cache Compression (constantly updating π€).β674Feb 24, 2026Updated 3 weeks ago
- β306Jul 10, 2025Updated 8 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ359Nov 20, 2025Updated 4 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,198Mar 9, 2026Updated last week
- The Official Implementation of Ada-KV [NeurIPS 2025]β128Nov 26, 2025Updated 3 months ago
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ286May 1, 2025Updated 10 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generationβ251Dec 16, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β818Mar 6, 2025Updated last year
- FlashInfer: Kernel Library for LLM Servingβ5,145Mar 15, 2026Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ531Feb 10, 2025Updated last year
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β506Aug 1, 2024Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ377Jul 10, 2025Updated 8 months ago
- KV cache compression for high-throughput LLM inferenceβ153Feb 5, 2025Updated last year
- β47Nov 25, 2024Updated last year
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ277Aug 31, 2024Updated last year
- Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."β18Dec 13, 2024Updated last year
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β418Mar 3, 2025Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β149Aug 9, 2024Updated last year
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ273Jul 6, 2025Updated 8 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ164Oct 13, 2025Updated 5 months ago
- Efficient LLM Inference over Long Sequencesβ393Jun 25, 2025Updated 8 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β1,041Sep 4, 2024Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ389Apr 13, 2025Updated 11 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Supercharge Your LLM with the Fastest KV Cache Layerβ7,693Updated this week
- Helpful tools and examples for working with flex-attentionβ1,157Feb 8, 2026Updated last month
- Unified KV Cache Compression Methods for Auto-Regressive Modelsβ1,311Jan 4, 2025Updated last year
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β977Feb 5, 2026Updated last month
- Distributed Compiler based on Triton for Parallel Systemsβ1,386Mar 11, 2026Updated last week
- [ICLR 2025π₯] D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Modelsβ27Jul 7, 2025Updated 8 months ago
- Official Implementation for [ICLR26] DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inferenceβ30Mar 15, 2026Updated last week
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β51Oct 18, 2024Updated last year
- β65Apr 26, 2025Updated 10 months ago
- Fast low-bit matmul kernels in Tritonβ438Feb 1, 2026Updated last month
- (ACL 2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generationβ34May 28, 2025Updated 9 months ago
- Awesome LLM compression research papers and tools.β1,789Feb 23, 2026Updated 3 weeks ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β4,953Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β3,231Updated this week
- π Efficient implementations of state-of-the-art linear attention modelsβ4,630Updated this week