umfranzw / cuda-reduction-example
This example starts with a simple sum reduction in CUDA, then steps through a series of optimizations we can perform to improve its performance on the GPU. These examples were created alongside a series of lectures (on GPGPU computing) for an undergraduate parallel computing course. You can find the lecture slides in the slides/ directory.
☆13Updated 4 years ago
Alternatives and similar repositories for cuda-reduction-example:
Users that are interested in cuda-reduction-example are comparing it to the libraries listed below
- High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS☆90Updated 7 months ago
- Universal number Posit HDL Arithmetic Architecture generator☆57Updated 5 years ago
- Systolic array implementations for Cholesky, LU, and QR decomposition☆42Updated 5 months ago
- MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine (accepted as full paper at FPT'23)☆20Updated last year
- Provides the code for the paper "EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators" by Luk…☆19Updated 5 years ago
- Matrix Operation Library for FPGA https://xilinx.github.io/gemx/☆63Updated 5 years ago
- ☆19Updated 2 months ago
- ☆35Updated last month
- ☆70Updated 5 years ago
- Vulkan-Sim is a GPU architecture simulator for Vulkan ray tracing based on GPGPU-Sim and Mesa.☆59Updated 2 months ago
- ☆46Updated last year
- ☆40Updated 3 years ago
- BLAS implementation for Intel FPGA☆78Updated 4 years ago
- AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP (Full Paper a…☆22Updated 3 weeks ago
- ☆66Updated 6 months ago
- FRAME: Fast Roofline Analytical Modeling and Estimation☆34Updated last year
- High-level synthesis (HLS) implementation of Sparse Matrix Vector Multiplication☆15Updated 3 years ago
- An FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).☆78Updated 9 months ago
- A general framework for optimizing DNN dataflow on systolic array☆35Updated 4 years ago
- HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond☆34Updated last week
- An open-source parameterizable NPU generator with full-stack multi-target compilation stack for intelligent workloads.☆50Updated last month
- ☆35Updated 4 years ago
- An Open Workflow to Build Custom SoCs and run Deep Models at the Edge☆76Updated 2 months ago
- A DAG processor and compiler for a tree-based spatial datapath.☆13Updated 2 years ago
- Tutorials on HLS Design☆51Updated 5 years ago
- Hands-on experience programming AI Engines using Vitis Unified Software Platform☆40Updated 9 months ago
- Designs for finalist teams of the DAC System Design Contest☆37Updated 4 years ago
- FPGA acceleration of arbitrary precision floating point computations.☆38Updated 2 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- ☆81Updated 2 months ago