[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
☆31Mar 30, 2025Updated last year
Alternatives and similar repositories for ZipCache
Users that are interested in ZipCache are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICML 2025] This is the official PyTorch implementation of "ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality…☆53Mar 25, 2025Updated last year
- [ICCV 2025] The official implementation of "Neighboring Autoregressive Modeling for Efficient Visual Generation"☆60Apr 5, 2025Updated last year
- torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.☆25Mar 29, 2024Updated 2 years ago
- [ACL 2026 Findings] CoV: Chain-of-View Prompting for Spatial Reasoning☆52Updated this week
- ☆16Sep 12, 2023Updated 2 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆24Oct 5, 2024Updated last year
- ☆12Sep 7, 2024Updated last year
- [NAACL 2025🔥] MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference☆19Jun 19, 2025Updated 9 months ago
- ☆52May 13, 2024Updated last year
- The official implementation of PTQD: Accurate Post-Training Quantization for Diffusion Models☆103Mar 12, 2024Updated 2 years ago
- The Official Implementation of Ada-KV [NeurIPS 2025]☆131Nov 26, 2025Updated 4 months ago
- This is the official PyTorch implementation of "BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation."☆40Oct 9, 2025Updated 6 months ago
- Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)☆31Jul 4, 2024Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆39Mar 11, 2024Updated 2 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Benchmarking Attention Mechanism in Vision Transformers.☆20Oct 10, 2022Updated 3 years ago
- Learning to Skip the Middle Layers of Transformers☆17Aug 7, 2025Updated 8 months ago
- Implement FlashAttention v2 with minimal code to learn.☆16Jun 12, 2024Updated last year
- [ACL-26 Findings] Implementation for HiPrune, a training-free visual token pruning method for VLM acceleration.☆52Updated this week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆181Jul 12, 2024Updated last year
- [ICCV2025] The official code of "DreamRelation: Relation-Centric Video Customization"☆28Feb 4, 2026Updated 2 months ago
- Official PyTorch code for ICLR 2025 paper "Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Models"☆23Mar 4, 2025Updated last year
- [AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Visio…☆45Apr 18, 2025Updated 11 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆336Jul 2, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Algorithms for approximate attention in LLMs☆22Apr 14, 2025Updated 11 months ago
- Official Pytorch implementation for LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (ICLR 2025 Oral).☆101Feb 11, 2025Updated last year
- Tiny optimized Stable-diffusion that can run on GPUs with just 1GB of VRAM. (Beta)☆182Jul 20, 2023Updated 2 years ago
- [IJCV 2025] Paragraph-to-Image Generation with Information-Enriched Diffusion Model☆107Mar 24, 2025Updated last year
- [ICML‘2024] "LoCoCo: Dropping In Convolutions for Long Context Compression", Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen☆17Sep 7, 2024Updated last year
- An example project showing how to build a pip-installable Python package that invokes custom CUDA/C++ code☆14Jul 12, 2017Updated 8 years ago
- Streaming Video Diffusion: Online Video Editing with Diffusion Models☆18Jun 3, 2024Updated last year
- This repository implements the paper "Effective Training of Convolutional Neural Networks with Low-bitwidth Weights and Activations"☆20Aug 30, 2021Updated 4 years ago
- ☆47Nov 25, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆381Nov 20, 2025Updated 4 months ago
- (CVPR 2022) Automated Progressive Learning for Efficient Training of Vision Transformers☆25Feb 26, 2025Updated last year
- [ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆55Aug 9, 2024Updated last year
- [EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context In…☆104Nov 9, 2024Updated last year
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆498Nov 26, 2024Updated last year
- The code for the paper "Efficient Self-Supervised Video Hashing with Selective State Spaces" (AAAI'25).☆23Aug 2, 2025Updated 8 months ago
- ☆12Sep 1, 2023Updated 2 years ago