AniZpZ/AutoSmoothQuant

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/AniZpZ/AutoSmoothQuant)

AniZpZ / AutoSmoothQuant

An easy-to-use package for implementing SmoothQuant for LLMs

☆111

Alternatives and similar repositories for AutoSmoothQuant

Users that are interested in AutoSmoothQuant are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆157Aug 21, 2025Updated 10 months ago
Adlik / smoothquantplus
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆23Mar 15, 2024Updated 2 years ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆343Jul 2, 2024Updated 2 years ago
JingyangXiang / DFRot
View on GitHub
[COLM 2025] DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation; 知乎：https://zhuanlan.zhihu.c…
☆30Mar 5, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆901Nov 26, 2025Updated 7 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,669Jul 12, 2024Updated 2 years ago
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 7 months ago
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆523Nov 26, 2024Updated last year
usyd-fsalab / fp6_llm
View on GitHub
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆282Jul 16, 2025Updated last year
AniZpZ / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆11Dec 13, 2023Updated 2 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
neuralmagic / nm-vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆267Dec 4, 2025Updated 7 months ago
IST-DASLab / gemm-fp8
View on GitHub
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆21Jan 24, 2025Updated last year
sgl-project / sgl-flash-attn
View on GitHub
Fast and memory-efficient exact attention
☆22Jun 26, 2026Updated 3 weeks ago
microsoft / chunk-attention
View on GitHub
☆89Apr 18, 2025Updated last year
ByteDance-Seed / decoupleQ
View on GitHub
A quantization algorithm for LLM
☆150Jun 21, 2024Updated 2 years ago
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆183Jul 12, 2024Updated 2 years ago
Raincleared-Song / sparse_gpu_operator
View on GitHub
GPU operators for sparse tensor operations
☆37Mar 11, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆430Aug 13, 2024Updated last year
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
IST-DASLab / QUIK
View on GitHub
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
☆185Apr 16, 2024Updated 2 years ago
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆191Updated this week
LeiWang1999 / AutoGPTQ.tvm
View on GitHub
GPTQ inference TVM kernel
☆41Apr 25, 2024Updated 2 years ago
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,591Jul 17, 2025Updated last year
nunchaku-ai / deepcompressor
View on GitHub
Model Compression Toolbox for Large Language Models and Diffusion Models
☆794Aug 14, 2025Updated 11 months ago
neuralmagic / AutoFP8
View on GitHub
☆210May 5, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Intelligent-Computing-Lab-Panda / GPTAQ
View on GitHub
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆92Jul 28, 2025Updated 11 months ago
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 11 months ago
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆315Updated this week
BBuf / tensorrt-llm-moe
View on GitHub
☆34Feb 3, 2025Updated last year
tigert1998 / qat
View on GitHub
Manually implemented quantization-aware training
☆22Oct 12, 2022Updated 3 years ago
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
UCLA-VAST / heterohalide
View on GitHub
HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration
☆15Sep 14, 2020Updated 5 years ago