LLM KV cache compression made easy
β1,120Jun 22, 2026Updated last week
Alternatives and similar repositories for kvpress
Users that are interested in kvpress are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π° Must-read papers on KV Cache Compression (constantly updating π€).β720Apr 15, 2026Updated 2 months ago
- β321Jul 10, 2025Updated 11 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ414Nov 20, 2025Updated 7 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,220Apr 8, 2026Updated 2 months ago
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ308May 1, 2025Updated last year
- End-to-end encrypted email - Proton Mail β’ AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- The Official Implementation of Ada-KV [NeurIPS 2025]β136Nov 26, 2025Updated 7 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generationβ254Dec 16, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β845Mar 6, 2025Updated last year
- FlashInfer: Kernel Library for LLM Servingβ5,867Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ539Feb 10, 2025Updated last year
- KV cache compression for high-throughput LLM inferenceβ158Feb 5, 2025Updated last year
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β523Aug 1, 2024Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ397Jul 10, 2025Updated 11 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β451Jun 17, 2026Updated last week
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β47Nov 25, 2024Updated last year
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decodingβ281Aug 31, 2024Updated last year
- Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."β18Dec 13, 2024Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β151Aug 9, 2024Updated last year
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ280Jul 6, 2025Updated 11 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ169Oct 13, 2025Updated 8 months ago
- Efficient LLM Inference over Long Sequencesβ394Jun 25, 2025Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β1,095Sep 4, 2024Updated last year
- GPU-accelerated algorithm for subsampling datasets while preserving diversityβ27Jan 12, 2024Updated 2 years ago
- GPUs on demand by Runpod - Special Offer Available β’ AdRun AI, ML, and HPC workloads on powerful cloud GPUsβwithout limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ389Apr 13, 2025Updated last year
- Unified KV Cache Compression Methods for Auto-Regressive Modelsβ1,349Jun 23, 2026Updated last week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ134Dec 3, 2024Updated last year
- LMCache: Supercharge Your LLM with the Fastest KV Cache Layerβ9,944Updated this week
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β1,006Feb 5, 2026Updated 4 months ago
- Helpful tools and examples for working with flex-attentionβ1,205Updated this week
- Fast low-bit matmul kernels in Tritonβ475May 15, 2026Updated last month
- Distributed Compiler based on Triton for Parallel Systemsβ1,466Updated this week
- [ICLR 2025π₯] D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Modelsβ27Jul 7, 2025Updated 11 months ago
- Virtual machines for every use case on DigitalOcean β’ AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β51Oct 18, 2024Updated last year
- β65Apr 26, 2025Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ962Mar 29, 2026Updated 3 months ago
- (ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generationβ36May 28, 2025Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β3,408Updated this week
- π Efficient implementations for emerging model architecturesβ5,249Jun 23, 2026Updated last week
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsβ6,552Updated this week