NVIDIA / TensorRT-Model-OptimizerLinks

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

☆1,078

Alternatives and similar repositories for TensorRT-Model-Optimizer

Users that are interested in TensorRT-Model-Optimizer are comparing it to the libraries listed below

Sorting:

NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Bla…
☆2,587Updated this week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 4 months ago
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,461Updated last year
huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆977Updated 3 weeks ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆868Updated 10 months ago
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆2,219Updated this week
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆870Updated last week
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
triton-inference-server / tutorials
This repository contains tutorials and examples for Triton Inference Server
☆742Updated last week
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆653Updated 3 weeks ago
ModelTC / LightCompress
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…
☆522Updated this week
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆635Updated this week
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆3,448Updated this week
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆1,690Updated last week
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…
☆551Updated this week
ByteDance-Seed / Triton-distributed
Distributed Compiler based on Triton for Parallel Systems
☆930Updated this week
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆887Updated 7 months ago
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆775Updated 11 months ago
intel / neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…
☆2,461Updated last week
NVIDIA / cudnn-frontend
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
☆596Updated 2 weeks ago
Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆616Updated 10 months ago
onnx / optimizer
ONNX Optimizer
☆735Updated 2 weeks ago
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆1,472Updated this week
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆410Updated 8 months ago
mirage-project / mirage
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
☆1,629Updated this week
ThanatosShinji / onnx-tool
A parser, editor and profiler tool for ONNX models.
☆446Updated last month
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆445Updated 10 months ago
thuml / depyf
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
☆706Updated 3 months ago
Tencent / TPAT
TensorRT Plugin Autogen Tool
☆369Updated 2 years ago
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…
☆702Updated 2 weeks ago