CoffeeBeforeArch / spring_2020_tutorial
"Hardware, Software, and Compilers! Oh My!" tutorial files
☆16Updated 5 years ago
Alternatives and similar repositories for spring_2020_tutorial:
Users that are interested in spring_2020_tutorial are comparing it to the libraries listed below
- ☆43Updated 4 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆105Updated 7 years ago
- Examples for using SYCL on CUDA☆62Updated 3 weeks ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- Algorithms implemented in CUDA + resources about GPGPU☆55Updated 3 years ago
- A unified framework across multiple programming platforms☆36Updated 9 months ago
- Serial and parallel implementations of matrix multiplication☆40Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last week
- Slides from the "Bits of Architecture" series on YouTube☆21Updated 2 years ago
- Learn OpenMP examples step by step☆91Updated 2 months ago
- ☆22Updated 2 years ago
- The ultimate memory bandwidth benchmark☆47Updated last month
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆89Updated last year
- ☆29Updated 5 years ago
- SYCL Benchmark Suite☆64Updated last month
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- ☆66Updated 11 years ago
- ☆34Updated 4 years ago
- Benchmark for measuring the performance of sparse and irregular memory access.☆77Updated last month
- Some CUDA design patterns and a bit of template magic for CUDA☆150Updated last year
- MiniAMR Adaptive Mesh Refinement (AMR) Mini-App☆34Updated 4 months ago
- BGHT: High-performance static GPU hash tables.☆62Updated 6 months ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- MagmaDNN: a simple deep learning framework in c++☆50Updated 4 years ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆51Updated last month
- ☆34Updated last year
- Distributed Communication-Optimal LU-factorization Algorithm☆12Updated 3 years ago
- RAJA Performance Suite☆118Updated this week
- The Task-Aware MPI (TAMPI) library extends the functionality of standard MPI libraries by providing new mechanisms for improving the inte…☆23Updated 4 months ago
- tools to create performance and roofline plots from measured data☆58Updated 10 years ago