CoffeeBeforeArch / mmulLinks
Serial and parallel implementations of matrix multiplication
☆42Updated 4 years ago
Alternatives and similar repositories for mmul
Users that are interested in mmul are comparing it to the libraries listed below
Sorting:
- NVIDIA tools guide☆144Updated 7 months ago
- CUDA Matrix Multiplication Optimization☆213Updated last year
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆287Updated last month
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆131Updated 5 years ago
- Examples from Programming in Parallel with CUDA☆158Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆138Updated 4 years ago
- ☆101Updated 2 years ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆72Updated 4 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆51Updated 5 months ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 8 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆85Updated last year
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated last year
- CUTLASS and CuTe Examples☆65Updated 3 weeks ago
- Benchmark for measuring the performance of sparse and irregular memory access.☆78Updated 3 months ago
- Simple neural network implementation using CUDA technology. It is an educational implementation.☆97Updated 7 years ago
- ☆45Updated 4 years ago
- Training material for Nsight developer tools☆163Updated last year
- ☆106Updated last year
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆346Updated this week
- ☆16Updated 3 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆151Updated 3 years ago
- GPUOcelot: A dynamic compilation framework for PTX☆207Updated 6 months ago
- Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben A…☆275Updated 4 months ago
- An extension library of WMMA API (Tensor Core API)☆99Updated last year
- Advanced Matrix Extensions (AMX) Guide☆95Updated 3 years ago
- Dissecting NVIDIA GPU Architecture☆103Updated 3 years ago
- Online CUDA Occupancy Calculator☆79Updated 3 years ago
- collection of benchmarks to measure basic GPU capabilities☆401Updated 5 months ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆238Updated 10 months ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆34Updated last year