HuangOwen/Awesome-LLM-Compression

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/HuangOwen/Awesome-LLM-Compression)

HuangOwen / Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

☆1,854

Alternatives and similar repositories for Awesome-LLM-Compression

Users that are interested in Awesome-LLM-Compression are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

horseee / Awesome-Efficient-LLM
View on GitHub
A curated list for Efficient Large Language Models
☆2,026Jun 17, 2025Updated last year
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,424Updated this week
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆524Nov 26, 2024Updated last year
AIoT-MLSys-Lab / Efficient-LLMs-Survey
View on GitHub
[TMLR 2024] Efficient Large Language Models: A Survey
☆1,259Jun 23, 2025Updated last year
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆727Apr 15, 2026Updated 3 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,674Jul 12, 2024Updated 2 years ago
ModelTC / LightCompress
View on GitHub
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
☆736May 14, 2026Updated 2 months ago
AI-Efficiency / Awesome-Model-Quantization
View on GitHub
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are co…
☆2,414Jul 10, 2026Updated 2 weeks ago
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆423Nov 20, 2025Updated 8 months ago
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 8 months ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆852Mar 6, 2025Updated last year
ruikangliu / FlatQuant
View on GitHub
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆223Nov 25, 2025Updated 8 months ago
locuslab / wanda
View on GitHub
A simple and effective LLM pruning approach.
☆868Aug 9, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
horseee / LLM-Pruner
View on GitHub
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baich…
☆1,133Oct 7, 2024Updated last year
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆903Nov 26, 2025Updated 8 months ago
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,600Jul 17, 2025Updated last year
hemingkx / SpeculativeDecodingPapers
View on GitHub
📰 Must-read papers and blogs on Speculative Decoding ⚡️
☆1,283Jun 27, 2026Updated last month
Zhen-Dong / Awesome-Quantization-Papers
View on GitHub
List of papers related to neural network quantization in recent AI conferences and journals.
☆838Mar 27, 2025Updated last year
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,342Mar 27, 2024Updated 2 years ago
SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆431Aug 13, 2024Updated last year
FMInference / H2O
View on GitHub
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆530Aug 1, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
pprp / Awesome-LLM-Prune
View on GitHub
Awesome list for LLM pruning.
☆297Oct 11, 2025Updated 9 months ago
facebookresearch / SpinQuant
View on GitHub
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆418Feb 14, 2025Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,223Apr 8, 2026Updated 3 months ago
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆769Aug 6, 2025Updated 11 months ago
IST-DASLab / sparsegpt
View on GitHub
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
☆891Aug 20, 2024Updated last year
Xnhyacinth / Awesome-LLM-Long-Context-Modeling
View on GitHub
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
☆2,147Updated this week
AI-Efficiency / Awesome-Efficient-AIGC
View on GitHub
A list of papers, docs, codes about efficient AIGC. This repo is aimed to provide the info for efficient AIGC research, including languag…
☆206Feb 10, 2025Updated last year
Dao-AILab / fast-hadamard-transform
View on GitHub
Fast Hadamard transform in CUDA, with a PyTorch interface
☆342Mar 10, 2026Updated 4 months ago
HuangOwen / RoLoRA
View on GitHub
[EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
☆41Sep 24, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆402Jul 10, 2025Updated last year
nunchaku-ai / deepcompressor
View on GitHub
Model Compression Toolbox for Large Language Models and Diffusion Models
☆796Aug 14, 2025Updated 11 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,111Sep 4, 2024Updated last year
facebookresearch / LLM-QAT
View on GitHub
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆327Mar 4, 2025Updated last year
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,443Jul 13, 2026Updated 2 weeks ago
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆184Jul 12, 2024Updated 2 years ago
microsoft / TransformerCompression
View on GitHub
For releasing code related to compression methods for transformers, accompanying our publications
☆462Jan 16, 2025Updated last year