CoffeeBeforeArch / mmul
Serial and parallel implementations of matrix multiplication
☆34Updated 3 years ago
Related projects: ⓘ
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆41Updated 3 weeks ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆109Updated 4 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆81Updated 6 months ago
- CUDA Matrix Multiplication Optimization☆118Updated 2 months ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆100Updated this week
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆90Updated 2 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆39Updated 8 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆56Updated 3 months ago
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆22Updated 8 months ago
- ☆39Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆21Updated last week
- Generate simple index ranges in C++ and CUDA C++☆38Updated last year
- MagmaDNN: a simple deep learning framework in c++☆45Updated 4 years ago
- ☆72Updated last year
- Advanced Profiling and Analytics for AMD Hardware☆132Updated last week
- ☆21Updated 2 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆165Updated 3 months ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆35Updated 7 years ago
- Intermediate MPI lesson☆25Updated last year
- ☆26Updated 4 years ago
- Benchmark for measuring the performance of sparse and irregular memory access.☆72Updated last week
- Examples from Programming in Parallel with CUDA☆101Updated last year
- Template for starting CUDA/C++ project using CMake with Github Action for CI☆29Updated last year
- Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses t…☆21Updated 3 years ago
- Examples for using SYCL on CUDA☆59Updated 2 weeks ago
- AMD’s C++ library for accelerating tensor primitives☆35Updated this week
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆78Updated last year
- CUDA kernel author's tools☆105Updated 2 years ago
- fast Fourier transform on GPU in shared memory for AstroAccelerate project☆24Updated 3 years ago
- Next generation LAPACK implementation for ROCm platform☆91Updated this week