facebookincubator / AITemplate
View external linksLinks

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

☆4,704

Alternatives and similar repositories for AITemplate

Users that are interested in AITemplate are comparing it to the libraries listed below

Sorting:

triton-lang / triton
View on GitHub
Development repository for the Triton language and compiler
☆18,387Updated this week
NVIDIA / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆6,392Mar 27, 2024Updated last year
ELS-RD / kernl
View on GitHub
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackab…
☆1,586Jan 28, 2026Updated 2 weeks ago
facebookresearch / xformers
View on GitHub
Hackable and optimized Transformers building blocks, supporting a composable construction.
☆10,336Feb 5, 2026Updated last week
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,152Feb 7, 2026Updated last week
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆22,231Updated this week
FMInference / FlexLLMGen
View on GitHub
Running large language models on a single GPU for throughput-oriented scenarios.
☆9,384Oct 28, 2024Updated last year
pytorch / torchdynamo
View on GitHub
A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
☆1,072Apr 17, 2024Updated last year
pytorch / TensorRT
View on GitHub
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
☆2,944Updated this week
deepspeedai / DeepSpeed
View on GitHub
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
☆41,578Feb 7, 2026Updated last week
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆9,266Updated this week
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆7,952Updated this week
apache / tvm
View on GitHub
Open Machine Learning Compiler Framework
☆13,117Updated this week
NVIDIA / TensorRT
View on GitHub
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source compone…
☆12,694Updated this week
deepspeedai / DeepSpeed-MII
View on GitHub
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,093Jun 30, 2025Updated 7 months ago
microsoft / torchscale
View on GitHub
Foundation Architecture for (M)LLMs
☆3,130Apr 11, 2024Updated last year
triton-inference-server / server
View on GitHub
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
☆10,361Updated this week
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,006Sep 19, 2024Updated last year
flexflow / flexflow-train
View on GitHub
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,859Feb 7, 2026Updated last week
ggml-org / ggml
View on GitHub
Tensor library for machine learning
☆13,946Updated this week
huggingface / accelerate
View on GitHub
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (i…
☆9,491Feb 6, 2026Updated last week
facebookresearch / fairscale
View on GitHub
PyTorch extensions for high performance and large scale training.
☆3,397Apr 26, 2025Updated 9 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆4,935Updated this week
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆15,162Updated this week
mlc-ai / mlc-llm
View on GitHub
Universal LLM Deployment Engine with ML Compilation
☆22,039Updated this week
alpa-projects / alpa
View on GitHub
Training and serving large-scale neural networks with auto parallelization.
☆3,180Dec 9, 2023Updated 2 years ago
hpcaitech / ColossalAI
View on GitHub
Making large AI models cheaper, faster and more accessible
☆41,346Jan 19, 2026Updated 3 weeks ago
CVCUDA / CV-CUDA
View on GitHub
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
☆2,650Jan 22, 2026Updated 3 weeks ago
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆12,867Updated this week
huggingface / diffusers
View on GitHub
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
☆32,768Updated this week
stochasticai / x-stable-diffusion
View on GitHub
Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention. Join our Discord communty…
☆561Dec 4, 2023Updated 2 years ago
chengzeyi / stable-fast
View on GitHub
https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
☆1,299Mar 27, 2025Updated 10 months ago
jax-ml / jax
View on GitHub
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
☆34,848Updated this week
huggingface / text-generation-inference
View on GitHub
Large Language Model Text Generation Inference
☆10,757Jan 8, 2026Updated last month
NVIDIA / DALI
View on GitHub
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep lear…
☆5,619Feb 8, 2026Updated last week
NVIDIA / apex
View on GitHub
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
☆8,918Updated this week
HazyResearch / ThunderKittens
View on GitHub
Tile primitives for speedy kernels
☆3,139Updated this week
sgl-project / sglang
View on GitHub
SGLang is a high-performance serving framework for large language models and multimodal models.
☆23,439Updated this week
microsoft / unilm
View on GitHub
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
☆22,021Jan 23, 2026Updated 3 weeks ago

facebookincubator / AITemplateView external linksLinks

Alternatives and similar repositories for AITemplate

facebookincubator / AITemplate
View external linksLinks