Adlik / smoothquantplusLinks

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

☆23

Alternatives and similar repositories for smoothquantplus

Users that are interested in smoothquantplus are comparing it to the libraries listed below

Sorting:

AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 6 months ago
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆122Updated last year
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
zhangsichengsjtu / AFPQ
AFPQ code implementation
☆23Updated last year
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆143Updated last year
Intelligent-Computing-Lab-Panda / GPTAQ
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆67Updated 2 months ago
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆160Updated 5 months ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 9 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
ClubieDong / QAQ-KVCacheQuantization
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆52Updated last year
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆179Updated last week
casper-hansen / AutoAWQ_kernels
☆78Updated 10 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆45Updated 3 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆60Updated last year
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆194Updated 2 years ago
microsoft / chunk-attention
☆78Updated 6 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 9 months ago
bytedance / AffineQuant
Official implementation of the ICLR 2024 paper AffineQuant
☆27Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 7 months ago
InternLM / turbomind
☆97Updated 7 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
ModelTC / Outlier_Suppression_Plus
Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…
☆47Updated 2 years ago
linxihui / dkernel
☆20Updated 6 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year