CoffeeBeforeArch / mmul
Serial and parallel implementations of matrix multiplication
☆40Updated 4 years ago
Alternatives and similar repositories for mmul:
Users that are interested in mmul are comparing it to the libraries listed below
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆125Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last year
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆106Updated this week
- Next generation LAPACK implementation for ROCm platform☆99Updated this week
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆51Updated 2 weeks ago
- AMD’s C++ library for accelerating tensor primitives☆38Updated this week
- ☆43Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆30Updated 3 months ago
- ☆29Updated 5 years ago
- ☆37Updated 3 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆36Updated 7 years ago
- Next generation SPARSE implementation for ROCm platform☆119Updated this week
- MPI accelerator-integrated communication extensions☆32Updated last year
- Learn OpenMP examples step by step☆90Updated last month
- An extension library of WMMA API (Tensor Core API)☆91Updated 8 months ago
- NVIDIA tools guide☆112Updated 2 months ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆22Updated last year
- ☆16Updated 3 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆87Updated last year
- The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear …☆71Updated 3 weeks ago
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆95Updated 8 months ago
- ROCm Parallel Primitives☆170Updated this week
- CUDA Matrix Multiplication Optimization☆169Updated 7 months ago
- A GPU accelerated error-bounded lossy compression for scientific data.☆72Updated 2 weeks ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆56Updated 3 weeks ago
- Reusable software components for ROCm developers☆83Updated this week
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆65Updated 4 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 8 months ago
- Benchmark for measuring the performance of sparse and irregular memory access.☆77Updated last month