pkucnc / awesome-rocmLinks
Collections and tutorials for ROCm
β29Updated 5 months ago
Alternatives and similar repositories for awesome-rocm
Users that are interested in awesome-rocm are comparing it to the libraries listed below
Sorting:
- πππ This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTβ¦β395Updated 3 months ago
- π A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and softwareβ55Updated 8 months ago
- collection of benchmarks to measure basic GPU capabilitiesβ436Updated last week
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operatorsβ481Updated this week
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.β122Updated this week
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorialβ314Updated last week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and verβ¦β265Updated 2 months ago
- DeepSeek-V3/R1 inference performance simulatorβ170Updated 7 months ago
- oneAPI Collective Communications Library (oneCCL)β244Updated last week
- Advanced Matrix Extensions (AMX) Guideβ105Updated 3 years ago
- β261Updated 2 weeks ago
- FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.β128Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.β553Updated 6 months ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.β154Updated 3 years ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.β705Updated 2 months ago
- Summary of the Specs of Commonly Used GPUs for Training and Inference of LLMβ63Updated 2 months ago
- π€FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3xβπ vs SDPA EA.β227Updated 2 months ago
- RCCL Performance Benchmark Testsβ76Updated 2 weeks ago
- AI Tensor Engine for ROCmβ292Updated last week
- ROCm Communication Collectives Library (RCCL)β397Updated this week
- An experimental CPU backend for Tritonβ155Updated last week
- Microsoft Collective Communication Libraryβ67Updated 11 months ago
- CUDA Matrix Multiplication Optimizationβ234Updated last year
- CUTLASS and CuTe Examplesβ93Updated 2 weeks ago
- β122Updated this week
- β90Updated 7 months ago
- SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUsβ44Updated this week
- LLM Inference analyzer for different hardware platformsβ94Updated 3 months ago
- Shared Middle-Layer for Triton Compilationβ298Updated last week
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process comβ¦β358Updated 2 weeks ago