CoffeeBeforeArch / mmulLinks
Serial and parallel implementations of matrix multiplication
☆41Updated 4 years ago
Alternatives and similar repositories for mmul
Users that are interested in mmul are comparing it to the libraries listed below
Sorting:
- CUDA Matrix Multiplication Optimization☆188Updated 10 months ago
- Learn OpenMP examples step by step☆95Updated 4 months ago
- ☆44Updated 4 years ago
- NVIDIA tools guide☆133Updated 4 months ago
- AMD’s C++ library for accelerating tensor primitives☆41Updated this week
- Benchmark for measuring the performance of sparse and irregular memory access.☆78Updated last month
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆134Updated 4 years ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆22Updated last year
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 11 months ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆82Updated last year
- ☆98Updated 2 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 2 months ago
- ☆29Updated 5 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆70Updated 2 months ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆120Updated this week
- ☆23Updated 3 years ago
- Advanced Profiling and Analytics for AMD Hardware☆156Updated this week
- "Hardware, Software, and Compilers! Oh My!" tutorial files☆16Updated 5 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆52Updated 2 months ago
- BLAS implementation for Intel FPGA☆78Updated 4 years ago
- Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses t…☆34Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆97Updated 10 months ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆56Updated last month
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆86Updated last week
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆67Updated 4 years ago
- Online CUDA Occupancy Calculator☆76Updated 3 years ago
- ☆17Updated 3 weeks ago