intel / auto-roundLinks

Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.

☆668

Alternatives and similar repositories for auto-round

Users that are interested in auto-round are comparing it to the libraries listed below

Sorting:

mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆883Updated last month
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆659Updated 5 months ago
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆828Updated last week
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,141Updated 3 weeks ago
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆352Updated 11 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆306Updated 5 months ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆446Updated 9 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆904Updated last month
Cornell-RelaxML / quip-sharp
☆559Updated 11 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆171Updated last week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,106Updated this week
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆859Updated 5 months ago
huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆995Updated last week
NVIDIA / kvpress
LLM KV cache compression made easy
☆660Updated this week
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆433Updated 10 months ago
LeanModels / DFloat11
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆550Updated last month
facebookresearch / SpinQuant
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆335Updated 8 months ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆318Updated 3 weeks ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
neuralmagic / AutoFP8
☆205Updated 5 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 6 months ago
microsoft / T-MAC
Low-bit LLM inference on CPU/NPU with lookup table
☆871Updated 4 months ago
Cornell-RelaxML / qtip
☆152Updated 4 months ago
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆343Updated 5 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆494Updated 8 months ago