CoffeeBeforeArch / mmul
Serial and parallel implementations of matrix multiplication
☆40Updated 4 years ago
Alternatives and similar repositories for mmul:
Users that are interested in mmul are comparing it to the libraries listed below
- ☆37Updated 3 years ago
- NVIDIA tools guide☆118Updated 2 months ago
- Examples from Programming in Parallel with CUDA☆130Updated 2 years ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆127Updated 4 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 9 months ago
- Learn OpenMP examples step by step☆91Updated 2 months ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆51Updated last month
- "Hardware, Software, and Compilers! Oh My!" tutorial files☆16Updated 5 years ago
- ☆43Updated 4 years ago
- CUDA Matrix Multiplication Optimization☆173Updated 8 months ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆36Updated 7 years ago
- ☆29Updated 5 years ago
- ☆16Updated 2 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆247Updated last week
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆76Updated last year
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆106Updated this week
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆56Updated this week
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆84Updated last year
- End to End steps for adding custom ops in PyTorch.☆21Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆91Updated 8 months ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆130Updated 4 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆88Updated last year
- Benchmark for measuring the performance of sparse and irregular memory access.☆77Updated last month
- ☆91Updated 2 years ago
- ☆92Updated 11 months ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆204Updated 3 months ago
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆89Updated last year
- AMD’s C++ library for accelerating tensor primitives☆39Updated this week
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago