VITA-Group / Q-HitterView external linksLinks
☆15Jun 4, 2024Updated last year
Alternatives and similar repositories for Q-Hitter
Users that are interested in Q-Hitter are comparing it to the libraries listed below
Sorting:
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆174Jul 10, 2024Updated last year
- [ISCA'25] LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading☆13Jun 28, 2025Updated 7 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆44Aug 14, 2024Updated last year
- ☆31Nov 11, 2024Updated last year
- ☆19Mar 11, 2025Updated 11 months ago
- Code for "FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge". EMNLP 2023.☆20Dec 25, 2023Updated 2 years ago
- ☆14Nov 7, 2025Updated 3 months ago
- ☆22Mar 7, 2025Updated 11 months ago
- Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)☆25Jul 4, 2024Updated last year
- [TRETS 2025][FPGA 2024] FPGA Accelerator for Imbalanced SpMV using HLS☆20Aug 24, 2025Updated 5 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆49Jun 19, 2024Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆148Aug 9, 2024Updated last year
- ViTALiTy (HPCA'23) Code Repository☆23Mar 13, 2023Updated 2 years ago
- ☆17Feb 13, 2021Updated 5 years ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆112Oct 15, 2024Updated last year
- Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning☆58Mar 26, 2024Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Jul 12, 2024Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆372Jul 10, 2025Updated 7 months ago
- ☆352Apr 2, 2024Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Jul 17, 2025Updated 6 months ago
- Ok-Topk is a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k c…☆27Dec 10, 2022Updated 3 years ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆31Jan 27, 2025Updated last year
- An experimentation platform for LLM inference optimisation☆35Sep 19, 2024Updated last year
- Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …☆36Aug 29, 2025Updated 5 months ago
- ☆13Jan 28, 2026Updated 2 weeks ago
- ☆39Aug 27, 2024Updated last year
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference☆87Apr 26, 2025Updated 9 months ago
- Public repository of the UCSC CMPE220 class project☆10Oct 8, 2017Updated 8 years ago
- a TensorFlow implementation of the paper "Feature Super-Resolution Based Facial Expression Recognition for Multi-scale Low-Resolution Ima…☆13Nov 30, 2021Updated 4 years ago
- Linux integrity monitoring for CentOS/RHEL☆10May 13, 2020Updated 5 years ago
- VMSDK implements the Evidence API☆11Nov 25, 2024Updated last year
- A simple cycle-accurate DaDianNao simulator☆13Mar 27, 2019Updated 6 years ago
- A memory allocator that aims to eliminate dangling pointer vulnerabilities at a low overhead, using virtualisation via Dune. My Computer …☆10Nov 27, 2019Updated 6 years ago
- A simple 8086-CPU simulator using Verilog and Quartus II☆10Jul 9, 2018Updated 7 years ago
- Code repository for experiments in SpecROP paper☆13Sep 3, 2021Updated 4 years ago
- DELTA-pytorch:DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation☆12Apr 16, 2024Updated last year
- Implement some method of LLM KV Cache Sparsity☆41Jun 6, 2024Updated last year
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆52Aug 6, 2025Updated 6 months ago
- ☆132Jun 24, 2024Updated last year