ROCm / AITemplateLinks

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

☆11

Alternatives and similar repositories for AITemplate

Users that are interested in AITemplate are comparing it to the libraries listed below

Sorting:

ROCm / triton
Development repository for the Triton language and compiler
☆125Updated this week
ROCm / flash-attention
Fast and memory-efficient exact attention
☆174Updated this week
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆67Updated last week
ROCm / aiter
AI Tensor Engine for ROCm
☆208Updated this week
ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆84Updated this week
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆191Updated this week
ROCm / TransformerEngine
☆38Updated this week
Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆43Updated 10 months ago
ROCm / hipBLASLt
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆106Updated this week
ROCm / rocmProfileData
☆25Updated last week
ColfaxResearch / cutlass-kernels
☆212Updated 11 months ago
ROCm / rocWMMA
rocWMMA
☆117Updated this week
intel / torch-ccl
oneCCL Bindings for Pytorch*
☆97Updated 2 months ago
ColfaxResearch / cfx-article-src
☆117Updated last month
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 11 months ago
intel / torch-xpu-ops
☆48Updated this week
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆252Updated 8 months ago
ROCm / apex
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
☆22Updated this week
intel / xetla
☆62Updated 6 months ago
north-numerical-computing / tensor-cores-numerical-behavior
Test suite for probing the numerical behavior of NVIDIA tensor cores
☆40Updated 11 months ago
codeplaysoftware / cutlass-sycl
A CUTLASS implementation using SYCL
☆27Updated this week
intel / intel-extension-for-deepspeed
Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…
☆61Updated last week
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆39Updated last week
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated 11 months ago
ROCm / rccl
ROCm Communication Collectives Library (RCCL)
☆343Updated this week
RRZE-HPC / gpu-benches
collection of benchmarks to measure basic GPU capabilities
☆385Updated 4 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆117Updated this week
ROCm / amd_matrix_instruction_calculator
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
☆102Updated last month
ROCm / bitsandbytes
8-bit CUDA functions for PyTorch
☆53Updated last week
ROCm / rocprofiler
ROC profiler library. Profiling with perf-counters and derived metrics.
☆148Updated last week