intel/neural-compressor

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/intel/neural-compressor)

intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

☆2,666

Alternatives and similar repositories for neural-compressor

Users that are interested in neural-compressor are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

intel / intel-extension-for-transformers
View on GitHub
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…
☆2,177Oct 8, 2024Updated last year
intel / intel-extension-for-pytorch
View on GitHub
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
☆2,015Mar 30, 2026Updated 3 months ago
openvinotoolkit / nncf
View on GitHub
Neural Network Compression Framework for enhanced OpenVINO™ inference
☆1,176Updated this week
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,662Jul 12, 2024Updated last year
intel / auto-round
View on GitHub
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support…
☆1,486Updated this week
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,327Mar 27, 2024Updated 2 years ago
qualcomm / aimet
View on GitHub
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
☆2,647Updated this week
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆8,286Jun 22, 2026Updated last week
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,578Jul 17, 2025Updated 11 months ago
huggingface / optimum
View on GitHub
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…
☆3,426Jun 22, 2026Updated last week
NVIDIA / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆6,433Mar 27, 2024Updated 2 years ago
Qualcomm-AI-research / FP8-quantization
View on GitHub
☆173Mar 9, 2023Updated 3 years ago
Efficient-ML / Awesome-Model-Quantization
View on GitHub
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are co…
☆2,396May 11, 2026Updated last month
intel / neural-speed
View on GitHub
An innovative library for efficient LLM inference via low-bit quantization
☆353Aug 30, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
huggingface / optimum-quanto
View on GitHub
A pytorch quantization backend for optimum
☆1,044Updated this week
onnx / neural-compressor
View on GitHub
Model compression for ONNX
☆101May 1, 2026Updated last month
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆899Nov 26, 2025Updated 7 months ago
uxlfoundation / oneDNN
View on GitHub
oneAPI Deep Neural Network Library (oneDNN)
☆4,009Updated this week
pytorch / ao
View on GitHub
PyTorch native quantization and sparsity for training and inference
☆2,875Updated this week
wejoncy / QLLM
View on GitHub
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆190Mar 23, 2026Updated 3 months ago
huggingface / optimum-intel
View on GitHub
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
☆602Updated this week
neuralmagic / deepsparse
View on GitHub
Sparsity-aware deep learning inference runtime for CPUs
☆3,160Jun 2, 2025Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆845Mar 6, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
triton-lang / triton
View on GitHub
Development repository for the Triton language and compiler
☆19,525Updated this week
OpenPPL / ppq
View on GitHub
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
☆1,803Mar 28, 2024Updated 2 years ago
NVIDIA / TensorRT
View on GitHub
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source compone…
☆13,102Updated this week
ModelTC / MQBench
View on GitHub
Model Quantization Benchmark
☆868Apr 20, 2025Updated last year
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,221Jun 22, 2026Updated last week
SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
facebookincubator / AITemplate
View on GitHub
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…
☆4,721Apr 9, 2026Updated 2 months ago
intel / ai-reference-models
View on GitHub
Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Inte…
☆732Feb 11, 2026Updated 4 months ago
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,072Apr 11, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,095Sep 4, 2024Updated last year
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,408Updated this week
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆13,994Updated this week
vllm-project / llm-compressor
View on GitHub
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆3,439Updated this week
casper-hansen / AutoAWQ
View on GitHub
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,349May 11, 2025Updated last year
deepspeedai / DeepSpeed
View on GitHub
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
☆42,586Updated this week
VainF / Torch-Pruning
View on GitHub
[CVPR 2023] DepGraph: Towards Any Structural Pruning; LLMs, Vision Foundation Models, etc.
☆3,322Sep 7, 2025Updated 9 months ago