☆60Updated this week
Alternatives and similar repositories for TransformerEngine
Users that are interested in TransformerEngine are comparing it to the libraries listed below
Sorting:
- ☆23Jul 11, 2025Updated 7 months ago
- ☆11Jun 29, 2021Updated 4 years ago
- MAD (Model Automation and Dashboarding)☆31Feb 11, 2026Updated 2 weeks ago
- Ongoing research training transformer models at scale☆37Feb 20, 2026Updated last week
- Fast and memory-efficient exact attention☆221Updated this week
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- ☆30Feb 11, 2026Updated 2 weeks ago
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆144Updated this week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 2 weeks ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror☆521Updated this week
- Ahead of Time (AOT) Triton Math Library☆92Updated this week
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆17Nov 11, 2025Updated 3 months ago
- Tile-based language built for AI computation across all scales☆123Feb 14, 2026Updated 2 weeks ago
- AI Tensor Engine for ROCm☆356Feb 21, 2026Updated last week
- AMD’s C++ library for accelerating tensor primitives☆49Feb 18, 2026Updated last week
- Hackable and optimized Transformers building blocks, supporting a composable construction.☆34Updated this week
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆172Feb 21, 2026Updated last week
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆57Updated this week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆139Updated this week
- ☆38Aug 7, 2025Updated 6 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 8 months ago
- Primus-SaFE(Stability and Fault Endurance)☆50Feb 21, 2026Updated last week
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆86Feb 11, 2026Updated 2 weeks ago
- ☆71Feb 21, 2026Updated last week
- ☆23Feb 17, 2026Updated last week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 8 months ago
- A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch☆25Updated this week
- ☆74Updated this week
- Development repository for the Triton language and compiler☆140Updated this week
- ☆13Jan 7, 2025Updated last year
- CPU and GPU tutorial examples☆13Apr 4, 2025Updated 10 months ago
- ☆71Mar 26, 2025Updated 11 months ago
- An extension library of WMMA API (Tensor Core API)☆109Jul 12, 2024Updated last year
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆113Feb 20, 2026Updated last week
- ☆15Oct 30, 2025Updated 3 months ago
- ☆12Dec 31, 2020Updated 5 years ago
- ☆24May 9, 2025Updated 9 months ago
- A PyTorch native platform for training generative AI models☆15Nov 18, 2025Updated 3 months ago