codeplaysoftware / cutlass-syclLinks
A CUTLASS implementation using SYCL
☆31Updated this week
Alternatives and similar repositories for cutlass-sycl
Users that are interested in cutlass-sycl are comparing it to the libraries listed below
Sorting:
- ☆62Updated 7 months ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆92Updated this week
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆107Updated last month
- An extension library of WMMA API (Tensor Core API)☆99Updated last year
- Assembler for NVIDIA Volta and Turing GPUs☆224Updated 3 years ago
- OpenAI Triton backend for Intel® GPUs☆193Updated this week
- ☆48Updated this week
- ☆104Updated last year
- Dissecting NVIDIA GPU Architecture☆101Updated 3 years ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆40Updated 11 months ago
- rocWMMA☆119Updated this week
- ☆148Updated this week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆61Updated 2 weeks ago
- Ahead of Time (AOT) Triton Math Library☆70Updated last week
- ☆25Updated 3 weeks ago
- A hierarchical collective communications library with portable optimizations☆35Updated 7 months ago
- Advanced Profiling and Analytics for AMD Hardware☆159Updated this week
- Provides the examples to write and build Habana custom kernels using the HabanaTools☆22Updated 3 months ago
- RCCL Performance Benchmark Tests☆70Updated this week
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆106Updated 7 years ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 4 months ago
- ROC profiler library. Profiling with perf-counters and derived metrics.☆150Updated this week
- ☆51Updated 6 years ago
- oneCCL Bindings for Pytorch*☆99Updated last week
- GPU Performance Advisor☆65Updated 2 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆33Updated 4 years ago
- Fast GPU based tensor core reductions☆13Updated 2 years ago
- amdgpu example code in hip/asm☆35Updated last month
- Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs☆12Updated 3 months ago
- ☆20Updated 2 months ago