ROCm / composable_kernelLinks
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
☆423Updated this week
Alternatives and similar repositories for composable_kernel
Users that are interested in composable_kernel are comparing it to the libraries listed below
Sorting:
- collection of benchmarks to measure basic GPU capabilities☆385Updated 4 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆425Updated 9 months ago
- CUDA Kernel Benchmarking Library☆669Updated this week
- Shared Middle-Layer for Triton Compilation☆255Updated this week
- OpenAI Triton backend for Intel® GPUs☆191Updated this week
- Assembler for NVIDIA Volta and Turing GPUs☆222Updated 3 years ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆337Updated this week
- ROCm Communication Collectives Library (RCCL)☆342Updated this week
- AI Tensor Engine for ROCm☆208Updated this week
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆508Updated 2 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆357Updated 5 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆245Updated this week
- Development repository for the Triton language and compiler☆125Updated this week
- CUDA Matrix Multiplication Optimization☆196Updated 11 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆364Updated 9 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆98Updated last month
- ☆212Updated 11 months ago
- Step-by-step optimization of CUDA SGEMM☆339Updated 3 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆379Updated this week
- Experimental projects related to TensorRT☆105Updated last week
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆247Updated last week
- Yinghan's Code Sample☆332Updated 2 years ago
- AMD's graph optimization engine.☆223Updated this week
- A simple high performance CUDA GEMM implementation.☆380Updated last year
- ☆148Updated this week
- cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it☆582Updated last week
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆396Updated 3 weeks ago
- A model compilation solution for various hardware☆437Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆246Updated last week
- ☆117Updated last month