seb-v / fp32_sgemm_amd
Super fast FP32 matrix multiplication on RDNA3
☆46Updated 3 weeks ago
Alternatives and similar repositories for fp32_sgemm_amd:
Users that are interested in fp32_sgemm_amd are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last month
- GPUOcelot: A dynamic compilation framework for PTX☆187Updated 2 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆89Updated 2 weeks ago
- Tenstorrent MLIR compiler☆120Updated this week
- amdgpu example code in hip/asm☆31Updated last week
- Utilities for accessing AMD's Machine-Readable GPU ISA Specifications.☆32Updated last month
- MLIR-based partitioning system☆80Updated this week
- Nvidia Instruction Set Specification Generator☆256Updated 9 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated last month
- A GLSL compiler targeting SPIR-V mlir☆19Updated 6 months ago
- rocWMMA☆109Updated this week
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆97Updated 9 months ago
- Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.☆134Updated last week
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆90Updated last month
- A framework that support executing unmodified CUDA source code on non-NVIDIA devices.☆121Updated 3 months ago
- ☆141Updated this week
- ☆54Updated 10 months ago
- The HIP Environment and ROCm Kit - A lightweight open source build system for HIP and ROCm☆49Updated this week
- The missing pieces (as far as boilerplate reduction goes) of the upstream MLIR python bindings.☆89Updated this week
- MLIR metal dialect☆26Updated 7 months ago
- This is the AMD-maintained fork of the LLVM git repository. This repository accepts pull requests and issues related to AMD fork-specific…☆143Updated this week
- Exploring the scalable matrix extension of the Apple M4 processor☆171Updated 5 months ago
- Reference Kernels for the Leaderboard☆33Updated last week
- The TT-Forge FE is a graph compiler designed to optimize and transform computational graphs for deep learning models, enhancing their per…☆40Updated this week
- A profiler to disclose and quantify hardware features on GPUs.☆168Updated 2 years ago
- The translator that supports translating NVPTX to SPIR-V. This translator is modified from LLVM-SPIR-V Translator.☆38Updated 3 years ago
- CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.☆117Updated 2 years ago
- Repo for AI Compiler team. The intended purpose of this repo is for implementation of a PJRT device.☆15Updated this week
- TPP experimentation on MLIR for linear algebra☆127Updated this week