codeplaysoftware / cutlass-fork

CUDA Templates for Linear Algebra Subroutines

☆20

Alternatives and similar repositories for cutlass-fork:

Users that are interested in cutlass-fork are comparing it to the libraries listed below

intel / xetla
☆61Updated 3 months ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆95Updated 9 months ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆57Updated this week
ROCm / rccl-tests
RCCL Performance Benchmark Tests
☆63Updated this week
ROCm / TransformerEngine
☆27Updated this week
ROCm / rocSHMEM
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
☆76Updated this week
intel / torch-xpu-ops
☆38Updated this week
sunlex0717 / DissectingTensorCores
☆94Updated 11 months ago
daadaada / gas
☆43Updated 4 years ago
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆40Updated last month
ParCoreLab / Snoopie
Multi-GPU communication profiler and visualizer
☆28Updated 10 months ago
merthidayetoglu / HiCCL
A hierarchical collective communications library with portable optimizations
☆33Updated 4 months ago
oneapi-src / level-zero-spec
☆20Updated 3 months ago
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆90Updated 2 years ago
Jokeren / GPA
GPU Performance Advisor
☆64Updated 2 years ago
ColfaxResearch / cfx-article-src
☆97Updated last month
lixiuhong / batched_gemm
☆38Updated 5 years ago
north-numerical-computing / tensor-cores-numerical-behavior
Test suite for probing the numerical behavior of NVIDIA tensor cores
☆37Updated 8 months ago
ROCm / rocmProfileData
☆22Updated last month
intel / intel-extension-for-deepspeed
Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…
☆62Updated last month
ROCm / rocMLIR
☆141Updated this week
ROCm / rocm_bandwidth_test
Bandwidth test for ROCm
☆54Updated this week
zhisbug / Cavs
Cavs: An Efficient Runtime System for Dynamic Neural Networks
☆14Updated 4 years ago
ROCm / roctracer
ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
☆81Updated this week
sandeepkumar-skb / pytorch_custom_op
End to End steps for adding custom ops in PyTorch.
☆21Updated 4 years ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆80Updated last week
yifuwang / symm-mem-recipes
☆66Updated 3 months ago
intel / llvm-test-suite
☆20Updated 2 years ago
PAA-NCIC / PPoPP2017_artifact
Third party assembler and GEMM library for NVIDIA Kepler GPU
☆81Updated 5 years ago
decodecudabinary / Decoding-CUDA-Binary
☆51Updated 5 years ago