snu-mllab / KVzipLinks

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)

☆125

Alternatives and similar repositories for KVzip

Users that are interested in KVzip are comparing it to the libraries listed below

Sorting:

wdlctc / headinfer
☆58Updated 5 months ago
SalesforceAIResearch / GemFilter
☆85Updated 9 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆147Updated last week
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
ZihanWang314 / CoE
Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models
☆220Updated last month
Cornell-RelaxML / yaqa-quantization
☆60Updated 4 months ago
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆197Updated 4 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆155Updated 6 months ago
woct0rdho / transformers-qwen3-moe-fused
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆197Updated this week
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆51Updated 6 months ago
NathanGodey / qfilters
Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)
☆35Updated 7 months ago
Multiverse4FM / Multiverse
☆80Updated 4 months ago
gabrielolympie / moe-pruner
A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size
☆74Updated last month
BaohaoLiao / RSD
[ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.
☆49Updated 5 months ago
eqimp / hogwild_llm
Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
☆127Updated 2 months ago
Tencent / llm.hunyuan.T1
☆84Updated 6 months ago
Cornell-RelaxML / qtip
☆152Updated 4 months ago
bigai-nlco / TokenSwift
[ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
☆115Updated 5 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆308Updated 5 months ago
xichen-fy / Fira
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
☆115Updated last year
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆92Updated 5 months ago
s-sahoo / Eso-LMs
Esoteric Language Models
☆101Updated 2 weeks ago
Infini-AI-Lab / Multiverse
☆98Updated last month
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆98Updated 11 months ago
wuhy68 / Parameter-Efficient-MoE
Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks (EMNLP'24)
☆147Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
hetailang / SqueezeAttention
☆38Updated last year
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆448Updated 5 months ago
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆162Updated 6 months ago