pkucnc / awesome-rocmLinks
Collections and tutorials for ROCm
β29Updated 6 months ago
Alternatives and similar repositories for awesome-rocm
Users that are interested in awesome-rocm are comparing it to the libraries listed below
Sorting:
- πππ This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTβ¦β419Updated 4 months ago
- π A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and softwareβ60Updated 9 months ago
- FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-langβ¦β146Updated this week
- CUTLASS and CuTe Examplesβ112Updated 2 weeks ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.β133Updated this week
- collection of benchmarks to measure basic GPU capabilitiesβ474Updated last month
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.β90Updated 3 years ago
- An experimental CPU backend for Tritonβ167Updated last month
- Hands-On Practical MLIR Tutorialβ46Updated 4 months ago
- LLM Inference analyzer for different hardware platformsβ97Updated 2 weeks ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel β¦β190Updated 10 months ago
- β163Updated last year
- β274Updated last month
- π€FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3xβπ vs SDPA EA.β239Updated last month
- DeepSeek-V3/R1 inference performance simulatorβ172Updated 8 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)β146Updated 5 years ago
- CUDA Matrix Multiplication Optimizationβ245Updated last year
- A Easy-to-understand TensorOp Matmul Tutorialβ395Updated 2 months ago
- Solution of Programming Massively Parallel Processorsβ50Updated last year
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operatorsβ499Updated this week
- Tile-based language built for AI computation across all scalesβ98Updated this week
- β‘οΈWrite HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peakβ‘οΈ Performance.β137Updated 7 months ago
- Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.β94Updated 2 years ago
- β156Updated 11 months ago
- A lightweight design for computation-communication overlap.β196Updated 2 months ago
- An extension library of WMMA API (Tensor Core API)β109Updated last year
- β32Updated last year
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and verβ¦β285Updated 4 months ago
- Yinghan's Code Sampleβ360Updated 3 years ago
- Dissecting NVIDIA GPU Architectureβ115Updated 3 years ago