CoffeeBeforeArch / mmulLinks
Serial and parallel implementations of matrix multiplication
☆41Updated 4 years ago
Alternatives and similar repositories for mmul
Users that are interested in mmul are comparing it to the libraries listed below
Sorting:
- ☆100Updated 2 years ago
- Learn OpenMP examples step by step☆95Updated 5 months ago
- Examples for using SYCL on CUDA☆62Updated 2 weeks ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- NVIDIA tools guide☆138Updated 6 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆76Updated 3 months ago
- ☆45Updated 4 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆96Updated last year
- ☆16Updated 2 years ago
- Examples from Programming in Parallel with CUDA☆157Updated 2 years ago
- This repository collects the materials from the course "Foundations of HPC", 2021, at the Data Science and Scientific Computing Departmen…☆23Updated 3 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben A…☆275Updated 3 months ago
- C++ HPC Tutorial materials☆54Updated last year
- fast Fourier transform on GPU in shared memory for AstroAccelerate project☆26Updated 4 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- ☆67Updated 11 years ago
- Kernel Tuning Toolkit☆61Updated 2 weeks ago
- Exercises and Solutions for "Programming Your GPU with OpenMP: A Hands-On Introduction"☆144Updated 3 months ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆21Updated last year
- The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear …☆79Updated last week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆119Updated last week
- ☆29Updated 5 years ago
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆91Updated last year
- "Hardware, Software, and Compilers! Oh My!" tutorial files☆16Updated 5 years ago
- ☆23Updated 3 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated last year
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆84Updated last year
- CUDA Matrix Multiplication Optimization☆201Updated 11 months ago