NVIDIA / TransformerEngineLinks

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

☆2,763

Alternatives and similar repositories for TransformerEngine

Users that are interested in TransformerEngine are comparing it to the libraries listed below

Sorting:

flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆3,861Updated this week
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆2,392Updated last week
NVIDIA / FasterTransformer
Transformer related optimization, including BERT, GPT
☆6,320Updated last year
NVIDIA / TensorRT-Model-Optimizer
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. …
☆1,431Updated last week
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆2,803Updated this week
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,516Updated last year
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,289Updated 2 months ago
deepspeedai / Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆2,166Updated 2 months ago
flexflow / flexflow-train
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,836Updated last week
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆780Updated last year
facebookresearch / fairscale
PyTorch extensions for high performance and large scale training.
☆3,380Updated 5 months ago
deepspeedai / DeepSpeed-MII
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,063Updated 3 months ago
intel / neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…
☆2,508Updated this week
IST-DASLab / gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,196Updated last year
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆897Updated this week
mirage-project / mirage
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆1,877Updated this week
microsoft / Tutel
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
☆928Updated last week
ByteDance-Seed / Triton-distributed
Distributed Compiler based on Triton for Parallel Systems
☆1,152Updated last week
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,252Updated last month
huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆991Updated last month
NVIDIA / nccl-tests
NCCL Tests
☆1,284Updated last week
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆3,508Updated this week
pytorch / torchdynamo
A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
☆1,062Updated last year
srush / Triton-Puzzles
Puzzles for learning Triton
☆2,031Updated 10 months ago
bytedance / flux
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,142Updated last month
pytorch / kineto
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
☆879Updated last week
pytorch / FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
☆1,449Updated this week
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆941Updated 9 months ago
ELS-RD / kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackab…
☆1,584Updated last year
volcengine / veScale
A PyTorch Native LLM Training Framework
☆874Updated last month