wejoncy / QLLMLinks

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

☆175

Alternatives and similar repositories for QLLM

Users that are interested in QLLM are comparing it to the libraries listed below

Sorting:

hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆193Updated 2 years ago
neuralmagic / AutoFP8
☆195Updated 2 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆137Updated 3 months ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆102Updated 3 months ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆211Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆303Updated 2 years ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆288Updated 2 months ago
Adlik / smoothquantplus
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆23Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆365Updated 11 months ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆305Updated 5 months ago
casper-hansen / AutoAWQ_kernels
☆76Updated 8 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated 9 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆141Updated this week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆318Updated last year
mlc-ai / llm-perf-bench
☆120Updated last year
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆141Updated last year
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆117Updated last year
Macaronlin / LLaMA3-Quantization
A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..
☆192Updated 6 months ago
haochengxi / Train_Transformers_with_INT4
☆153Updated 2 years ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 2 weeks ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆106Updated 2 years ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆437Updated 6 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆153Updated last year
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆221Updated 6 months ago
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆244Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆149Updated 2 weeks ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 5 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago