CoffeeBeforeArch / mmul
Serial and parallel implementations of matrix multiplication
☆35Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for mmul
- "Hardware, Software, and Compilers! Oh My!" tutorial files☆17Updated 4 years ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆100Updated this week
- MagmaDNN: a simple deep learning framework in c++☆45Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆43Updated 10 months ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆44Updated last month
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆116Updated 4 years ago
- ☆41Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆29Updated 2 months ago
- Learn OpenMP examples step by step☆86Updated 3 years ago
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆91Updated 4 months ago
- Examples from Programming in Parallel with CUDA☆108Updated last year
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 5 months ago
- Next generation LAPACK implementation for ROCm platform☆94Updated this week
- AMD’s C++ library for accelerating tensor primitives☆35Updated this week
- Next generation SPARSE implementation for ROCm platform☆117Updated this week
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆68Updated last year
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆83Updated 9 months ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆25Updated last year
- Advanced Profiling and Analytics for AMD Hardware☆137Updated this week
- RAJA Performance Suite☆110Updated this week
- ☆30Updated this week
- Short examples illustrating AVX2 intrinsics for simple tasks.☆83Updated 8 months ago
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- ☆82Updated last year
- Template for starting CUDA/C++ project using CMake with Github Action for CI☆29Updated last year
- GPUOcelot: A dynamic compilation framework for PTX☆147Updated last month
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆99Updated 7 years ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆22Updated 10 months ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆128Updated 4 years ago