CoffeeBeforeArch / mmul
Serial and parallel implementations of matrix multiplication
☆40Updated 4 years ago
Alternatives and similar repositories for mmul:
Users that are interested in mmul are comparing it to the libraries listed below
- ☆91Updated 2 years ago
- NVIDIA tools guide☆127Updated 3 months ago
- "Hardware, Software, and Compilers! Oh My!" tutorial files☆16Updated 5 years ago
- ☆43Updated 4 years ago
- CUDA Matrix Multiplication Optimization☆179Updated 9 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆130Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last month
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 10 months ago
- An extension library of WMMA API (Tensor Core API)☆96Updated 9 months ago
- ☆40Updated 3 years ago
- Learn OpenMP examples step by step☆91Updated 3 months ago
- C++ HPC Tutorial materials☆49Updated 9 months ago
- ☆23Updated 3 years ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆51Updated this week
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆59Updated last month
- ☆67Updated 11 years ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆107Updated last week
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆88Updated last year
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆67Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 3 weeks ago
- The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear …☆72Updated 3 weeks ago
- Next generation LAPACK implementation for ROCm platform☆99Updated last week
- Advanced Profiling and Analytics for AMD Hardware☆148Updated this week
- Benchmark for measuring the performance of sparse and irregular memory access.☆75Updated last week
- ☆29Updated 5 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆130Updated 4 years ago
- MagmaDNN: a simple deep learning framework in c++☆49Updated 4 years ago
- Examples from Programming in Parallel with CUDA☆134Updated 2 years ago
- Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses t…☆32Updated 3 years ago