pytorch / aoLinks

PyTorch native quantization and sparsity for training and inference

☆2,438

Alternatives and similar repositories for ao

Users that are interested in ao are comparing it to the libraries listed below

Sorting:

huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆995Updated last week
Lightning-AI / lightning-thunder
PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily wri…
☆1,415Updated this week
NVIDIA / TensorRT-Model-Optimizer
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. …
☆1,443Updated last week
pytorch / torchtitan
A PyTorch native platform for training generative AI models
☆4,561Updated this week
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Bla…
☆2,834Updated this week
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆2,821Updated last week
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,267Updated last month
meta-pytorch / attention-gym
Helpful tools and examples for working with flex-attention
☆1,020Updated this week
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆3,911Updated last week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆883Updated last month
srush / Triton-Puzzles
Puzzles for learning Triton
☆2,036Updated 11 months ago
fla-org / flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
☆3,517Updated this week
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,106Updated this week
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆780Updated last year
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,318Updated 3 months ago
huggingface / picotron
Minimalistic 4D-parallelism distributed training framework for education purpose
☆1,856Updated last month
mirage-project / mirage
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆1,891Updated this week
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆578Updated 2 months ago
intel / neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…
☆2,511Updated this week
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,525Updated last year
jiaweizzhao / GaLore
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
☆1,610Updated 11 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
pytorch / torchdynamo
A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
☆1,064Updated last year
facebookresearch / schedule_free
Schedule-Free Optimization in PyTorch
☆2,224Updated 5 months ago
meta-pytorch / torchtune
PyTorch native post-training library
☆5,547Updated this week
huggingface / optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…
☆3,121Updated last week
microsoft / Samba
[ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
☆915Updated 5 months ago
casper-hansen / AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,256Updated 5 months ago
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆3,658Updated this week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago