huggingface / optimum-quanto
View external linksLinks

A pytorch quantization backend for optimum

☆1,023

Alternatives and similar repositories for optimum-quanto

Users that are interested in optimum-quanto are comparing it to the libraries listed below

Sorting:

pytorch / ao
View on GitHub
PyTorch native quantization and sparsity for training and inference
☆2,668Updated this week
dropbox / hqq
View on GitHub
Official implementation of Half-Quadratic Quantization (HQQ)
☆913Dec 18, 2025Updated last month
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆7,939Jan 22, 2026Updated 3 weeks ago
huggingface / optimum
View on GitHub
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…
☆3,279Jan 15, 2026Updated 3 weeks ago
nunchaku-ai / deepcompressor
View on GitHub
Model Compression Toolbox for Large Language Models and Diffusion Models
☆753Aug 14, 2025Updated 6 months ago
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,436Jul 17, 2025Updated 6 months ago
casper-hansen / AutoAWQ
View on GitHub
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,314May 11, 2025Updated 9 months ago
intel / neural-compressor
View on GitHub
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, …
☆2,581Feb 5, 2026Updated last week
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,028Apr 11, 2025Updated 10 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,011Sep 4, 2024Updated last year
huggingface / diffusion-fast
View on GitHub
Faster generation with text-to-image diffusion models.
☆230Jun 28, 2025Updated 7 months ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆812Mar 6, 2025Updated 11 months ago
NVIDIA / Model-Optimizer
View on GitHub
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…
☆1,964Updated this week
meta-pytorch / torchtune
View on GitHub
PyTorch native post-training library
☆5,669Updated this week
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆887Nov 26, 2025Updated 2 months ago
NetEase-FuXi / EETQ
View on GitHub
Easy and Efficient Quantization for Transformers
☆205Jan 28, 2026Updated 2 weeks ago
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,607Jul 12, 2024Updated last year
vllm-project / compressed-tensors
View on GitHub
A safetensors extension to efficiently store sparse quantized tensors on disk
☆244Updated this week
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆22,231Updated this week
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,293Jan 21, 2026Updated 3 weeks ago
Vahe1994 / AQLM
View on GitHub
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.p…
☆1,315Aug 8, 2025Updated 6 months ago
huggingface / text-generation-inference
View on GitHub
Large Language Model Text Generation Inference
☆10,757Jan 8, 2026Updated last month
vllm-project / llm-compressor
View on GitHub
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,705Feb 6, 2026Updated last week
wejoncy / QLLM
View on GitHub
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆184Apr 2, 2025Updated 10 months ago
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆336Jul 2, 2024Updated last year
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆404Aug 13, 2024Updated last year
huggingface / safetensors
View on GitHub
Simple, safe way to store and distribute tensors
☆3,619Feb 4, 2026Updated last week
sayakpaul / diffusers-torchao
View on GitHub
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
☆393Jan 8, 2026Updated last month
huggingface / accelerate
View on GitHub
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (i…
☆9,491Feb 6, 2026Updated last week
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,544Dec 11, 2025Updated 2 months ago
intel / auto-round
View on GitHub
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…
☆845Feb 6, 2026Updated last week
facebookresearch / xformers
View on GitHub
Hackable and optimized Transformers building blocks, supporting a composable construction.
☆10,336Feb 5, 2026Updated last week
chengzeyi / stable-fast
View on GitHub
https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
☆1,299Mar 27, 2025Updated 10 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆4,935Updated this week
huggingface / peft
View on GitHub
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
☆20,619Updated this week
meta-pytorch / float8_experimental
View on GitHub
This repository contains the experimental PyTorch native float8 training UX
☆226Aug 1, 2024Updated last year
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆482Nov 26, 2024Updated last year
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆12,867Updated this week
Lightning-AI / lightning-thunder
View on GitHub
PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily wri…
☆1,440Feb 3, 2026Updated last week

huggingface / optimum-quantoView external linksLinks

Alternatives and similar repositories for optimum-quanto

huggingface / optimum-quanto
View external linksLinks