ThisisBillhe / ZipCacheLinks
[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
β32Updated 10 months ago
Alternatives and similar repositories for ZipCache
Users that are interested in ZipCache are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] Mixture Compressor for Mixture-of-Experts LLMs Gains Moreβ66Updated 11 months ago
- [NeurIPS 2024 Oralπ₯] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.β180Updated last year
- Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cacheβ¦β197Updated 2 months ago
- The code repository of "MBQ: Modality-Balanced Quantization for Large Vision-Language Models"β75Updated 10 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ269Updated 7 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]β126Updated 2 months ago
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attentionβ278Updated 2 months ago
- [CVPR 2025] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformersβ74Updated last year
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inferenceβ48Updated last year
- D^2-MoE: Delta Decompression for MoE-based LLMs Compressionβ72Updated 10 months ago
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generationβ149Updated 10 months ago
- [ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsityβ70Updated 7 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsβ44Updated last year
- Efficient Mixture of Experts for LLM Paper Listβ166Updated 4 months ago
- β221Updated 2 months ago
- [COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"β67Updated 7 months ago
- 16-fold memory access reduction with nearly no lossβ110Updated 10 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>β155Updated 3 weeks ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ160Updated 3 months ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]β46Updated last week
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ192Updated 4 months ago
- [ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Modelsβ51Updated last year
- [NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Modelsβ129Updated 8 months ago
- β15Updated 10 months ago
- Official PyTorch implementation of the paper "Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Princβ¦β40Updated 6 months ago
- β190Updated last year
- qwen-nsaβ87Updated 3 months ago
- [NeurIPS'25] The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Tokβ¦β76Updated this week
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retentiβ¦β67Updated last year
- Activation-aware Singular Value Decomposition for Compressing Large Language Modelsβ88Updated last year