lcpu-club / awesome-rocmLinks

Collections and tutorials for ROCm

☆27

Alternatives and similar repositories for awesome-rocm

Users that are interested in awesome-rocm are comparing it to the libraries listed below

Sorting:

coderonion / awesome-cuda-and-hpc
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PT…
☆288Updated 3 weeks ago
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆53Updated this week
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆149Updated 3 months ago
guanrenyang / Programming-Massively-Parallel-Processors
Solution of Programming Massively Parallel Processors
☆48Updated last year
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago
ROCm / rocSHMEM
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
☆90Updated this week
mikeroyal / AMX-Guide
Advanced Matrix Extensions (AMX) Guide
☆92Updated 3 years ago
yuninxia / awesome-gemm
📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
☆41Updated 4 months ago
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆46Updated 3 months ago
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆274Updated 2 weeks ago
shenh10 / DeepSeek_Simulator
☆73Updated 2 months ago
libxsmm / tpp-pytorch-extension
Intel® Tensor Processing Primitives extension for Pytorch*
☆17Updated last week
ROCm / rocMLIR
☆148Updated this week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 5 months ago
Azure / msccl
Microsoft Collective Communication Library
☆64Updated 7 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆143Updated last week
sunlex0717 / DissectingTensorCores
☆98Updated last year
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆97Updated 2 years ago
SJTU-ReArch-Group / Paper-Reading-List
☆110Updated 3 weeks ago
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆57Updated 5 months ago
pku-liang / AMOS
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
☆112Updated 2 years ago
heheda12345 / MagPy
☆39Updated last year
Guangxuan-Xiao / SPMM-CUDA
☆12Updated 3 years ago
abhibambhaniya / GenZ-LLM-Analyzer
LLM Inference analyzer for different hardware platforms
☆74Updated last month
apache / tvm-rfcs
A home for the final text of all TVM RFCs.
☆105Updated 9 months ago
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.
☆186Updated last month
ROCm / hipBLASLt
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆106Updated this week
QianyanTech / NBAssembler
Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.
☆80Updated 2 years ago
PAA-NCIC / PE
performance engineering
☆30Updated 11 months ago
mlc-ai / mlc-python
☆35Updated last month