wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆87Updated 7 months ago
Alternatives and similar repositories for wmma_extension:
Users that are interested in wmma_extension are comparing it to the libraries listed below
- Dissecting NVIDIA GPU Architecture☆88Updated 2 years ago
- ☆87Updated 9 months ago
- ☆137Updated this week
- ☆38Updated 4 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆127Updated last year
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆73Updated last year
- ☆69Updated last month
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆124Updated 4 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆175Updated 2 weeks ago
- Assembler for NVIDIA Volta and Turing GPUs☆211Updated 3 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- ☆40Updated 4 years ago
- ☆60Updated last month
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆104Updated 7 years ago
- play gemm with tvm☆86Updated last year
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆100Updated this week
- development repository for the open earth compiler☆79Updated 3 years ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆85Updated 2 years ago
- ☆47Updated 5 years ago
- ☆42Updated 4 years ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 6 months ago
- GPU Performance Advisor☆64Updated 2 years ago
- A language and compiler for irregular tensor programs.☆135Updated 2 months ago
- rocWMMA☆100Updated this week
- collection of benchmarks to measure basic GPU capabilities☆290Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆103Updated 5 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆51Updated last week
- CUDA Matrix Multiplication Optimization☆159Updated 6 months ago
- amdgpu example code in hip/asm☆26Updated this week
- ☆180Updated 7 months ago