jy-yuan/KIVI

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jy-yuan/KIVI)

jy-yuan / KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

☆423

Alternatives and similar repositories for KIVI

Users that are interested in KIVI are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆431Aug 13, 2024Updated last year
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆184Jul 12, 2024Updated 2 years ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆402Jul 10, 2025Updated last year
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
FMInference / H2O
View on GitHub
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆530Aug 1, 2024Updated last year
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
FasterDecoding / SnapKV
View on GitHub
☆327Jul 10, 2025Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆852Mar 6, 2025Updated last year
henryzhongsc / longctx_bench
View on GitHub
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. EMNLP Findings 2024
☆90Feb 27, 2025Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,223Apr 8, 2026Updated 3 months ago
cat538 / SKVQ
View on GitHub
[COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆24Oct 5, 2024Updated last year
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆524Nov 26, 2024Updated last year
shadowpa0327 / Palu
View on GitHub
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆159Feb 20, 2025Updated last year
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆903Nov 26, 2025Updated 8 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,111Sep 4, 2024Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆539Feb 10, 2025Updated last year
SNU-ARC / any-precision-llm
View on GitHub
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆130Jul 4, 2025Updated last year
SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆156Dec 4, 2024Updated last year
Infini-AI-Lab / MagicPIG
View on GitHub
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆255Dec 16, 2024Updated last year
IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
nbasyl / LLM-FP4
View on GitHub
The official implementation of the EMNLP 2023 paper LLM-FP4
☆226Dec 15, 2023Updated 2 years ago
THUDM / LongBench
View on GitHub
LongBench v2 and LongBench (ACL 25'&24')
☆1,215Jan 15, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
ClubieDong / QAQ-KVCacheQuantization
View on GitHub
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆55Mar 27, 2024Updated 2 years ago
Aaronhuang-778 / BiLLM
View on GitHub
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆235Jan 11, 2025Updated last year
aiha-lab / MX-QLLM
View on GitHub
LLM Inference with Microscaling Format
☆35Nov 12, 2024Updated last year
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆107Mar 26, 2025Updated last year
FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆139Nov 26, 2025Updated 8 months ago
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,674Jul 12, 2024Updated 2 years ago
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆727Apr 15, 2026Updated 3 months ago
usyd-fsalab / fp6_llm
View on GitHub
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆282Jul 16, 2025Updated last year
hdong920 / LESS
View on GitHub
☆53May 13, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
aladinggit / RDMI
View on GitHub
This is the repo for remote direct memory introspection.
☆25Jun 21, 2023Updated 3 years ago
NVIDIA / kvpress
View on GitHub
LLM KV cache compression made easy
☆1,147Updated this week
OpenBitSys / BitDecoding
View on GitHub
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆96May 14, 2026Updated 2 months ago
OpenBitSys / BitDistiller
View on GitHub
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆139May 16, 2024Updated 2 years ago
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆158Feb 5, 2025Updated last year
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 8 months ago
ByteDance-Seed / ShadowKV
View on GitHub
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
☆311May 1, 2025Updated last year