ThisisBillhe / ZipCacheLinks

[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

☆29

Alternatives and similar repositories for ZipCache

Users that are interested in ZipCache are comparing it to the libraries listed below

Sorting:

Aaronhuang-778 / Mixture-Compressor-MoE
[ICLR 2025] Mixture Compressor for Mixture-of-Experts LLMs Gains More
☆58Updated 8 months ago
maomaocun / dLLM-cache
Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…
☆171Updated last month
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆239Updated 3 months ago
Hsu1023 / DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.
☆171Updated last year
thu-nics / MBQ
The code repository of "MBQ: Modality-Balanced Quantization for Large Vision-Language Models"
☆64Updated 7 months ago
thu-nics / ViDiT-Q
[ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
☆126Updated 7 months ago
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆342Updated 8 months ago
thu-nics / DiTFastAttn
☆182Updated 9 months ago
Juanerx / Q-DiT
[CVPR 2025] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
☆67Updated last year
z-lab / sparselora
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆60Updated 3 months ago
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆147Updated 3 months ago
attention-survey / Efficient_Attention_Survey
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
☆207Updated 2 months ago
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆79Updated 2 weeks ago
ThisisBillhe / ZipAR
[ICML 2025] This is the official PyTorch implementation of "ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality…
☆53Updated 7 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆147Updated 2 weeks ago
SUSTechBruce / LOOK-M
[EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context In…
☆103Updated 11 months ago
LINs-lab / DynMoE
[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
☆137Updated 3 months ago
ModelTC / TFMQ-DM
[CVPR 2024 Highlight & TPAMI 2025] This is the official PyTorch implementation of "TFMQ-DM: Temporal Feature Maintenance Quantization for…
☆106Updated 3 weeks ago
ChangyuanWang17 / QVLM
[NeurIPS'24]Efficient and accurate memory saving method towards W4A4 large multi-modal models.
☆85Updated 9 months ago
lliai / D2MoE
D^2-MoE: Delta Decompression for MoE-based LLMs Compression
☆69Updated 7 months ago
horseee / dKV-Cache
[NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models
☆110Updated 5 months ago
thu-nics / FrameFusion
[ICCV'25] The official code of paper "Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models"
☆63Updated 3 weeks ago
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆106Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆242Updated 2 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 7 months ago
A-suozhang / ViDiT-Q
☆15Updated 7 months ago
JieShibo / MoLE
[ICML 2025 Oral] Mixture of Lookup Experts
☆53Updated 5 months ago
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆140Updated last month
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆604Updated this week
thu-ml / SLA
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
☆113Updated this week