CoffeeBeforeArch / mmulLinks
Serial and parallel implementations of matrix multiplication
☆41Updated 4 years ago
Alternatives and similar repositories for mmul
Users that are interested in mmul are comparing it to the libraries listed below
Sorting:
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- AMD’s C++ library for accelerating tensor primitives☆42Updated this week
- ☆23Updated 3 years ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆119Updated this week
- ☆29Updated 5 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆83Updated last year
- An extension library of WMMA API (Tensor Core API)☆99Updated 11 months ago
- Next generation LAPACK implementation for ROCm platform☆103Updated this week
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆33Updated 2 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆73Updated 3 months ago
- ☆36Updated this week
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated last year
- My notes on various HPC papers.☆22Updated 2 years ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆21Updated last year
- ☆44Updated 4 years ago
- Examples from Programming in Parallel with CUDA☆153Updated 2 years ago
- Intermediate MPI lesson☆28Updated 2 years ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆56Updated 2 months ago
- ☆20Updated 9 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆95Updated last year
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- CUDA Matrix Multiplication Optimization☆196Updated 11 months ago
- rocWMMA☆115Updated last week
- Benchmark for measuring the performance of sparse and irregular memory access.☆78Updated last month
- NVIDIA tools guide☆135Updated 5 months ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆69Updated 4 years ago
- MPI accelerator-integrated communication extensions☆33Updated 2 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆133Updated last year
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆137Updated 4 years ago