cmd2001 / KVTunerLinks

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

☆23

Alternatives and similar repositories for KVTuner

Users that are interested in KVTuner are comparing it to the libraries listed below

Sorting:

ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆30Updated last year
duterscmy / CD-MoE
Official PyTorch implementation of CD-MOE
☆12Updated 7 months ago
facebookresearch / Ternary_Binary_Transformer
ACL 2023
☆39Updated 2 years ago
JarvisPei / CMoE
Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
☆30Updated 8 months ago
ChenMnZ / INT_vs_FP
A framework to compare low-bit integer and float-point formats
☆42Updated 3 weeks ago
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆96Updated 11 months ago
lightmatter-ai / INT-FP-QSim
Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.
☆51Updated 2 years ago
LiqunMa / FBI-LLM
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
☆51Updated 2 months ago
ModelTC / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆39Updated last year
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆36Updated last year
imagination-research / LCSC
[ICLR 2025] Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
☆16Updated 9 months ago
ruikangliu / Quantized-Reasoning-Models
[COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"
☆59Updated 4 months ago
TianjinYellow / StableSPAM
☆27Updated 7 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
shawnricecake / search-llm
[NeurIPS 2024] Search for Efficient LLMs
☆15Updated 10 months ago
ScalingIntelligence / CATS
☆30Updated last year
fmfi-compbio / admm-pruning
☆30Updated last year
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated last year
Intelligent-Computing-Lab-Panda / TesseraQ
☆23Updated last year
pittisl / ElasticTrainer
Code for paper "ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection" (MobiSys'23)
☆14Updated 2 years ago
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆41Updated last year
UNITES-Lab / C2R-MoE
[NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…
☆13Updated 9 months ago
HuangOwen / QAT-ACS
[TMLR] Official PyTorch implementation of paper "Efficient Quantization-aware Training with Adaptive Coreset Selection"
☆35Updated last year
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆20Updated last year
Qualcomm-AI-research / gptvq
☆37Updated last year
IntelLabs / Hardware-Aware-Automated-Machine-Learning
☆71Updated 3 months ago
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆47Updated last year
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆32Updated last year
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 4 months ago
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆27Updated 2 years ago