ROCm / composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
☆395Updated this week
Alternatives and similar repositories for composable_kernel
Users that are interested in composable_kernel are comparing it to the libraries listed below
Sorting:
- collection of benchmarks to measure basic GPU capabilities☆370Updated 3 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆405Updated 8 months ago
- Assembler for NVIDIA Volta and Turing GPUs☆218Updated 3 years ago
- OpenAI Triton backend for Intel® GPUs☆184Updated this week
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆496Updated 2 years ago
- ROCm Communication Collectives Library (RCCL)☆332Updated this week
- Shared Middle-Layer for Triton Compilation☆246Updated this week
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆345Updated 4 months ago
- CUDA Kernel Benchmarking Library☆631Updated this week
- Experimental projects related to TensorRT☆99Updated this week
- AMD's graph optimization engine.☆217Updated this week
- Stretching GPU performance for GEMMs and tensor contractions.☆237Updated this week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆324Updated this week
- Development repository for the Triton language and compiler☆120Updated this week
- ☆202Updated 10 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆346Updated 7 months ago
- AI Tensor Engine for ROCm☆190Updated this week
- CUDA Matrix Multiplication Optimization☆186Updated 9 months ago
- Yinghan's Code Sample☆327Updated 2 years ago
- Step-by-step optimization of CUDA SGEMM☆315Updated 3 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆349Updated this week
- A simple high performance CUDA GEMM implementation.☆366Updated last year
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆92Updated last month
- Next generation BLAS implementation for ROCm platform☆368Updated this week
- cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it☆554Updated last month
- A tool for bandwidth measurements on NVIDIA GPUs.☆420Updated 3 weeks ago
- rocWMMA☆110Updated last week
- amdgpu example code in hip/asm☆31Updated 3 weeks ago
- Training material for Nsight developer tools☆157Updated 9 months ago
- Development repository for the Triton-Linalg conversion☆185Updated 3 months ago