north-numerical-computing / tensor-cores-numerical-behavior
Test suite for probing the numerical behavior of NVIDIA tensor cores
☆38Updated 9 months ago
Alternatives and similar repositories for tensor-cores-numerical-behavior:
Users that are interested in tensor-cores-numerical-behavior are comparing it to the libraries listed below
- ☆96Updated last year
- An extension library of WMMA API (Tensor Core API)☆96Updated 9 months ago
- Dissecting NVIDIA GPU Architecture☆92Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆131Updated 4 years ago
- ☆30Updated this week
- ☆18Updated 5 years ago
- Ahead of Time (AOT) Triton Math Library☆62Updated 2 weeks ago
- ☆44Updated 4 years ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆89Updated last month
- A Winograd Minimal Filter Implementation in CUDA☆24Updated 3 years ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated last month
- Artifacts of EVT ASPLOS'24☆24Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆106Updated 9 months ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆87Updated 2 years ago
- ☆50Updated last year
- rocWMMA☆110Updated this week
- ☆38Updated 5 years ago
- A CUTLASS implementation using SYCL☆20Updated this week
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆109Updated 5 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆84Updated this week
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆135Updated 2 years ago
- ☆142Updated this week
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆51Updated last year
- Fast GPU based tensor core reductions☆13Updated 2 years ago
- ☆51Updated 5 years ago
- ☆78Updated 6 months ago
- ☆104Updated last month
- RCCL Performance Benchmark Tests☆64Updated last week
- CUDA Matrix Multiplication Optimization☆184Updated 9 months ago
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆50Updated 9 months ago