mark-poscablo / gpu-sum-reductionLinks
CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.
☆39Updated 8 years ago
Alternatives and similar repositories for gpu-sum-reduction
Users that are interested in gpu-sum-reduction are comparing it to the libraries listed below
Sorting:
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆40Updated 6 years ago
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆85Updated last year
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆335Updated this week
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆109Updated 8 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆134Updated 5 years ago
- Training material for Nsight developer tools☆173Updated last year
- 🎃 GPU load-balancing library for regular and irregular computations.☆63Updated 2 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆146Updated 5 years ago
- Efficient SpGEMM on GPU using CUDA and CSR☆58Updated 2 years ago
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- ☆17Updated 3 years ago
- ☆268Updated 3 weeks ago
- Subset of BLAS routines optimized for NVIDIA GPUs☆74Updated 2 years ago
- Assembler for NVIDIA Volta and Turing GPUs☆234Updated 3 years ago
- CSR-based SpGEMM on nVidia and AMD GPUs☆46Updated 9 years ago
- ☆48Updated 5 years ago
- CUDA Matrix Multiplication Optimization☆241Updated last year
- CUDA Tensor Transpose (cuTT) library☆53Updated 8 years ago
- Online CUDA Occupancy Calculator☆80Updated 4 years ago
- High-performance, GPU-aware communication library☆86Updated 10 months ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆211Updated this week
- ☆62Updated 11 months ago
- Efficient Top-K implementation on the GPU☆187Updated 6 years ago
- ☆597Updated last week
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆95Updated 3 years ago
- Parallel selection on GPUs☆15Updated 4 years ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- BGHT: High-performance static GPU hash tables.☆71Updated 5 months ago
- ☆71Updated 11 years ago
- ☆94Updated 8 years ago