Residual vector quantization for KV cache compression in large language model
☆11Oct 22, 2024Updated last year
Alternatives and similar repositories for vqllm
Users that are interested in vqllm are comparing it to the libraries listed below
Sorting:
- Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent☆17Sep 8, 2022Updated 3 years ago
- ☆17Jul 24, 2023Updated 2 years ago
- The official implementation of the DAC 2024 paper GQA-LUT☆20Dec 20, 2024Updated last year
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆83Dec 7, 2025Updated 3 months ago
- ☆20Nov 12, 2025Updated 3 months ago
- ☆20Sep 28, 2024Updated last year
- ☆25Oct 31, 2024Updated last year
- Beyond KV Caching: Shared Attention for Efficient LLMs☆20Jul 19, 2024Updated last year
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆24Oct 5, 2024Updated last year
- ☆20Jul 7, 2017Updated 8 years ago
- LLM Inference with Microscaling Format☆34Nov 12, 2024Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆32Jan 27, 2025Updated last year
- ☆42Mar 28, 2024Updated last year
- ☆36Dec 12, 2023Updated 2 years ago
- ☆38Mar 14, 2024Updated last year
- Kinematic and dynamic models of continuum and articulated soft robots.☆15Nov 22, 2025Updated 3 months ago
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆38Sep 24, 2024Updated last year
- ☆165Jun 22, 2025Updated 8 months ago
- PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization☆36Feb 21, 2024Updated 2 years ago
- MATLAB function to fill an area with hatching ~~or speckling~~☆11Mar 4, 2018Updated 8 years ago
- ☆14Apr 14, 2025Updated 10 months ago
- An artificial matrix generator in C☆12Feb 16, 2023Updated 3 years ago
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆57Nov 20, 2024Updated last year
- Notes and Examples to get started Parallel Computing with CUDA.☆13Nov 1, 2019Updated 6 years ago
- Code for the paper "Faster Neural Network Training with Approximate Tensor Operations"☆10Oct 23, 2021Updated 4 years ago
- Continuous Pipelined Speculative Decoding☆16Jan 4, 2026Updated 2 months ago
- BERT Sentiment Classification on the IMDb Large Movie Review Dataset.☆16Sep 8, 2022Updated 3 years ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆53Aug 6, 2025Updated 7 months ago
- Musings in GEMM (General Matrix Multiplication)☆14Dec 14, 2025Updated 2 months ago
- This repo contains the official code release of the Neural Experts paper, published in NeurIPS 2024.☆14Dec 3, 2024Updated last year
- CoMeT is a new low-cost RowHammer mitigation that uses Count-Min Sketch-based aggressor row tracking, as described in our HPCA'24 paper h…☆11Jan 23, 2026Updated last month
- FPGA-based HyperLogLog Accelerator☆12Jul 13, 2020Updated 5 years ago
- [COLM 2025: 1st Workshop on the Application of LLM Explainability to Reasoning and Planning] Latent Chain-of-Thought? Decoding the Depth-…☆17Oct 4, 2025Updated 5 months ago
- Locality sensitive hash functions for Tensorflow 2.0.☆12Feb 18, 2022Updated 4 years ago
- This repository is outdated and the related functionality has been migrated to https://github.com/easysoc/easysoc-firrtl☆11Nov 3, 2021Updated 4 years ago
- A merged read deduplication tool capable to perform merged read deduplication on single end data.☆12Sep 4, 2024Updated last year
- 4-bit Shampoo for Memory-Efficient Network Training (NeurIPS 2024)☆13Feb 13, 2025Updated last year
- Proof of Concept to learn Amaranth as an entry effort for Supercon's RTL design competition☆10Nov 11, 2022Updated 3 years ago