lcpu-club / awesome-rocm
Collections and tutorials for ROCm
β22Updated 11 months ago
Alternatives and similar repositories for awesome-rocm:
Users that are interested in awesome-rocm are comparing it to the libraries listed below
- π₯π₯π₯ A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Cβ¦β181Updated this week
- Advanced Matrix Extensions (AMX) Guideβ79Updated 3 years ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.β85Updated 2 years ago
- Examples of CUDA implementations by Cutlass CuTeβ132Updated this week
- Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines.β52Updated last year
- Microsoft Collective Communication Libraryβ61Updated 2 months ago
- performance engineeringβ27Updated 6 months ago
- LLM Inference analyzer for different hardware platformsβ47Updated this week
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel β¦β175Updated this week
- GEMM by WMMA (tensor core)β9Updated 2 years ago
- β25Updated 6 months ago
- β10Updated 2 years ago
- oneAPI Collective Communications Library (oneCCL)β218Updated last week
- collection of benchmarks to measure basic GPU capabilitiesβ287Updated 3 weeks ago
- β107Updated 6 months ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.β48Updated this week
- β30Updated 2 years ago
- β73Updated 2 years ago
- β25Updated 9 months ago
- An extension library of WMMA API (Tensor Core API)β87Updated 6 months ago
- β210Updated this week
- An experimental CPU backend for Tritonβ82Updated last week
- Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Acceleratorsβ106Updated 2 years ago
- Paella: Low-latency Model Serving with Virtualized GPU Schedulingβ59Updated 8 months ago
- CUDA PTX-ISA Document δΈζηΏ»θ―ηβ32Updated last month
- Some source code about matrix multiplication implementation on CUDAβ35Updated 6 years ago
- β46Updated 5 years ago
- β11Updated 2 years ago
- β38Updated 4 years ago
- PArallelLOOPgEneratoR: Threaded Loops Code Generation Infrastructure targeting Tensor Contraction Applications such as GEMMs, Convolutionβ¦β18Updated last month