FYYFU/HeadKV

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/FYYFU/HeadKV)

FYYFU / HeadKV

[ICLR2025] Code and data for paper: Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

☆42

Alternatives and similar repositories for HeadKV

Users that are interested in HeadKV are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆131Nov 26, 2025Updated 5 months ago
antgroup / cakekv
View on GitHub
☆38Mar 17, 2025Updated last year
FranxYao / Retrieval-Head-with-Flash-Attention
View on GitHub
Efficient retrieval head analysis with triton flash attention that supports topK probability
☆13Jun 15, 2024Updated last year
pangjh3 / AnLLM
View on GitHub
☆20Jun 17, 2024Updated last year
nightdessert / Retrieval_Head
View on GitHub
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆239Aug 2, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
MileBench / MileBench
View on GitHub
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
☆37Jul 11, 2024Updated last year
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆168Oct 13, 2025Updated 7 months ago
SqueezeAILab / MultipoleAttention
View on GitHub
[NeurIPS 2025] Multipole Attention for Efficient Long Context Reasoning
☆23Dec 5, 2025Updated 5 months ago
Hua-XC / MSCMNet
View on GitHub
person-reid MSCMNet
☆17Nov 8, 2024Updated last year
UCSB-NLP-Chang / KVLink
View on GitHub
☆43Oct 16, 2025Updated 7 months ago
eth-lre / LLM_ICL
View on GitHub
ACL24
☆11Jun 7, 2024Updated last year
tsinghua-ideal / Twilight
View on GitHub
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆100Apr 20, 2026Updated last month
JingyangYi / ShorterBetter
View on GitHub
☆16Jul 31, 2025Updated 9 months ago
ignoww / ZOODiP
View on GitHub
[CVPR 2025] Efficient Personalization of Quantized Diffusion Model without Backpropagation
☆16Mar 31, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
real-absolute-AI / RAPID
View on GitHub
[ICML 2025 Spotlight] RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
☆23Mar 2, 2025Updated last year
shoaibahmed / llm_depth_pruning
View on GitHub
Official implementation of the paper: "A deeper look at depth pruning of LLMs"
☆15Jul 24, 2024Updated last year
shadowpa0327 / Palu
View on GitHub
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆157Feb 20, 2025Updated last year
mlvlab / ST-VLM
View on GitHub
☆13Mar 28, 2025Updated last year
krafton-ai / lexico
View on GitHub
KV cache compression via sparse coding
☆17Oct 26, 2025Updated 6 months ago
MatthewKKai / MaLP
View on GitHub
Implementation Code for "LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination"
☆14Updated this week
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆541Feb 10, 2025Updated last year
hdong920 / LESS
View on GitHub
☆53May 13, 2024Updated 2 years ago
G-JWLee / TAMP
View on GitHub
☆13May 15, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
snu-comparch / Tender
View on GitHub
Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)
☆32Jul 4, 2024Updated last year
AnswerDotAI / cold-compress
View on GitHub
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆151Aug 9, 2024Updated last year
SqueezeAILab / SqueezedAttention
View on GitHub
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆60Nov 20, 2024Updated last year
lyan62 / vlm-info-loss
View on GitHub
☆22Sep 16, 2025Updated 8 months ago
varshakishore / IncDSI
View on GitHub
☆11Sep 10, 2023Updated 2 years ago
tian1327 / SWAT
View on GitHub
[CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
☆30Mar 15, 2026Updated 2 months ago
Small-Model-Gap / Small-Model-Learnability-Gap
View on GitHub
☆22Oct 10, 2025Updated 7 months ago
huangyuxiang03 / Locret
View on GitHub
☆14Oct 3, 2024Updated last year
kimyuji / EvolvingQA_benchmark
View on GitHub
Code and Dataset release of "Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models" (NAACL 2024)
☆10Oct 16, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
JIA-Lab-research / Q-LLM
View on GitHub
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
☆54Jul 16, 2024Updated last year
Lux0926 / ASPRM
View on GitHub
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
☆10Mar 2, 2025Updated last year
dongwonjo / FastKV
View on GitHub
[ACL Findings 2026] Official Implementation of "FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acc…
☆32Apr 14, 2026Updated last month
WindyLee0822 / CTG
View on GitHub
Source code of “Reinforcement Learning with Token-level Feedback for Controllable Text Generation (NAACL 2024)
☆17Dec 8, 2024Updated last year
mit-han-lab / flash-moba
View on GitHub
☆248Nov 19, 2025Updated 6 months ago
raymin0223 / fast_robust_early_exit
View on GitHub
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
☆65Sep 28, 2024Updated last year
passing2961 / DialogCC
View on GitHub
Official code and dataset for our NAACL 2024 paper: DialogCC: An Automated Pipeline for Creating High-Quality Multi-modal Dialogue Datase…
☆13Jun 24, 2024Updated last year