LLMkvsys / rethink-kv-compressionLinks

☆20

Alternatives and similar repositories for rethink-kv-compression

Users that are interested in rethink-kv-compression are comparing it to the libraries listed below

Sorting:

DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆49Updated 3 months ago
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆50Updated last year
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated last year
ASISys / AdaSkip
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
☆19Updated 9 months ago
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆47Updated last year
hyx1999 / SAM-Decoding
Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton
☆36Updated 9 months ago
pzs19 / TokenSelect
☆15Updated 8 months ago
Jingyu6 / speculative_prefill
☆46Updated 6 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆107Updated 7 months ago
Infini-AI-Lab / APE
☆34Updated 9 months ago
uservan / speculative_thinking
☆29Updated last month
ruipeterpan / specreason
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]
☆58Updated last month
Adaxry / Unified_Layer_Skipping
☆14Updated last year
Dominic789654 / LongGenBench
Source code for the paper "LongGenBench: Long-context Generation Benchmark"
☆24Updated last year
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆113Updated last month
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆20Updated last year
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆67Updated last week
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆41Updated last year
SNU-ARC / any-precision-llm
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆120Updated 4 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆154Updated last month
OpenSparseLLMs / Linearization
☆61Updated 4 months ago
dongwonjo / FastKV
Official Implementation of FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
☆26Updated 6 months ago
LiuXiaoxuanPKU / OSD
☆60Updated 11 months ago
mit-han-lab / fastrl
Efficient Reinforcement Learning for Language Models
☆43Updated this week
sail-sg / LongSpec
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
☆67Updated 4 months ago
RLsys-Foundation / APRIL
APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…
☆43Updated last month
BaiTheBest / SparseLLM
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
☆67Updated 7 months ago
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆32Updated last year
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆150Updated 4 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 3 months ago