Residual vector quantization for KV cache compression in large language model
☆12Oct 22, 2024Updated last year
Alternatives and similar repositories for vqllm
Users that are interested in vqllm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆83Dec 7, 2025Updated 3 months ago
- ☆17Jul 24, 2023Updated 2 years ago
- Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent☆16Sep 8, 2022Updated 3 years ago
- [ACL 2024 Findings] Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning☆13Sep 2, 2024Updated last year
- The official implementation of the DAC 2024 paper GQA-LUT☆21Dec 20, 2024Updated last year
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- ☆25Oct 31, 2024Updated last year
- LLM Inference with Microscaling Format☆34Nov 12, 2024Updated last year
- ☆20Sep 28, 2024Updated last year
- Beyond KV Caching: Shared Attention for Efficient LLMs☆20Jul 19, 2024Updated last year
- [TVLSI 2025] ACiM Inference Simulation Framework in "ASiM: Modeling and Analyzing Inference Accuracy of SRAM-Based Analog CiM Circuits"☆27Sep 9, 2025Updated 6 months ago
- 实现《Multiway Attention Networks for Modeling Sentence Pairs》中的网络模型,可用于问答,句子逻辑推理☆11Apr 13, 2020Updated 5 years ago
- Official implementation of "TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization" (Findings of ACL …☆21Jul 25, 2025Updated 8 months ago
- ☆10Sep 26, 2024Updated last year
- This repository presents the source code for the paper "MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Qu…☆23Apr 2, 2025Updated 11 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- 基于Surprise实现的具有完整功能的推荐系统服务,并利用flask框架实现了简单的接口调用☆11Jan 20, 2021Updated 5 years ago
- The official implementation of BiViT: Extremely Compressed Binary Vision Transformers☆16Jun 18, 2023Updated 2 years ago
- This was done as a part of the coursework for EE604A (Image Processing) at IIT Kanpur. MATLAB implementation of the inpainting algorithm …☆12Nov 12, 2017Updated 8 years ago
- EECS 151/251A FPGA Project Skeleton for Spring 2020☆12May 6, 2020Updated 5 years ago
- Tableau-based reasoner for ALCQ description logic☆13May 1, 2020Updated 5 years ago
- ☆166Jun 22, 2025Updated 9 months ago
- Source code of the paper "Hiring Now: A Skill-Aware Multi-Attention Model for Job Posting Generation, ACL2020"☆10May 26, 2020Updated 5 years ago
- ☆20Nov 12, 2025Updated 4 months ago
- a Computing In Memory emULATOR framework☆15May 19, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆38Sep 24, 2024Updated last year
- [CVPR 2025] LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant☆27Dec 2, 2025Updated 3 months ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆82Dec 18, 2025Updated 3 months ago
- An individual project related to denoising for event camera☆19Feb 25, 2024Updated 2 years ago
- ☆18Jul 13, 2019Updated 6 years ago
- Official Implementation of SEA: Sparse Linear Attention with Estimated Attention Mask (ICLR 2024)☆11Jun 20, 2025Updated 9 months ago
- The code repository of "MBQ: Modality-Balanced Quantization for Large Vision-Language Models"☆83Mar 17, 2025Updated last year
- Code for the ACL'18 paper: A Neural Approach to Pun Generation☆18Jan 13, 2020Updated 6 years ago
- a xv6 GUI.☆12Jan 19, 2016Updated 10 years ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆24Oct 5, 2024Updated last year
- Benchmarking general decision-making with open & random worlds☆20Mar 20, 2026Updated last week
- ☆15Jan 12, 2026Updated 2 months ago
- ☆20Jul 7, 2017Updated 8 years ago
- 2020年秋国科大模式识别(刘成林、向世明、张煦尧)课后作业☆10Feb 3, 2021Updated 5 years ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆251Dec 16, 2024Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago