umfranzw / cuda-reduction-exampleLinks
This example starts with a simple sum reduction in CUDA, then steps through a series of optimizations we can perform to improve its performance on the GPU. These examples were created alongside a series of lectures (on GPGPU computing) for an undergraduate parallel computing course. You can find the lecture slides in the slides/ directory.
☆13Updated 4 years ago
Alternatives and similar repositories for cuda-reduction-example
Users that are interested in cuda-reduction-example are comparing it to the libraries listed below
Sorting:
- ☆68Updated 7 months ago
- Approximate layers - TensorFlow extension☆27Updated last month
- MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine (accepted as full paper at FPT'23)☆21Updated last year
- ☆35Updated 2 months ago
- High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS☆91Updated 8 months ago
- Examples shown as part of the tutorial "Productive parallel programming on FPGA with high-level synthesis".☆199Updated 3 years ago
- This course provides professors with an understanding of high-level synthesis design methodologies necessary to develop digital systems u…☆53Updated 6 years ago
- A general framework for optimizing DNN dataflow on systolic array☆36Updated 4 years ago
- GPGPU-Sim provides a detailed simulation model of a contemporary GPU running CUDA and/or OpenCL workloads and now includes an integrated…☆54Updated this week
- ☆35Updated 4 years ago
- ☆44Updated 5 years ago
- Vulkan-Sim is a GPU architecture simulator for Vulkan ray tracing based on GPGPU-Sim and Mesa.☆60Updated 4 months ago
- An Open Workflow to Build Custom SoCs and run Deep Models at the Edge☆79Updated 2 weeks ago
- TensorCore Vector Processor for Deep Learning - Google Summer of Code Project☆22Updated 3 years ago
- Example code for Modern SystemC using Modern C++☆63Updated 2 years ago
- Provides the code for the paper "EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators" by Luk…☆19Updated 5 years ago
- Hands-on experience programming AI Engines using Vitis Unified Software Platform☆40Updated 10 months ago
- Algorithmic C Math Library☆62Updated 2 weeks ago
- CHARM: Composing Heterogeneous Accelerators on Heterogeneous SoC Architecture☆143Updated this week
- BLAS implementation for Intel FPGA☆78Updated 4 years ago
- Dissecting NVIDIA GPU Architecture☆95Updated 2 years ago
- Posit Arithmetic Cores generated with FloPoCo☆24Updated 11 months ago
- Parallel sparse direct solver for circuit simulation☆43Updated 2 years ago
- High-level synthesis (HLS) implementation of Sparse Matrix Vector Multiplication☆15Updated 3 years ago
- ☆30Updated 2 months ago
- Linux docker for the DNN accelerator exploration infrastructure composed of Accelergy and Timeloop☆52Updated last month
- Pursuing the best performance of linear solver in circuit simulation☆37Updated 3 months ago
- ☆96Updated last year
- [TCAD'23] AccelTran: A Sparsity-Aware Accelerator for Transformers☆44Updated last year
- An FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).☆79Updated 10 months ago