mark-poscablo / gpu-sum-reductionLinks

CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.

☆37

Alternatives and similar repositories for gpu-sum-reduction

Users that are interested in gpu-sum-reduction are comparing it to the libraries listed below

Sorting:

NVlabs / cub
THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.
☆84Updated last year
cwpearson / nvidia-performance-tools
Instructions, Docker images, and examples for Nsight Compute and Nsight Systems
☆131Updated 5 years ago
poojahira / spmv-cuda
Implementation and analysis of five different GPU based SPMV algorithms in CUDA
☆41Updated 6 years ago
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆287Updated last month
ekondis / gpumembench
A GPU benchmark suite for assessing on-chip GPU memory bandwidth
☆106Updated 7 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
xmartlabs / cuda-calculator
Online CUDA Occupancy Calculator
☆79Updated 3 years ago
dumerrill / merge-spmv
☆94Updated 8 years ago
weifengliu-ssslab / Benchmark_SpGEMM_using_CSR
CSR-based SpGEMM on nVidia and AMD GPUs
☆46Updated 9 years ago
gunrock / loops
🎃 GPU load-balancing library for regular and irregular computations.
☆62Updated last year
NVIDIA / nsight-training
Training material for Nsight developer tools
☆163Updated 11 months ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago
cyanguwa / nersc-roofline
☆45Updated 4 years ago
GPUPeople / spECK
Efficient SpGEMM on GPU using CUDA and CSR
☆57Updated 2 years ago
ndd314 / cuda_examples
☆68Updated 11 years ago
ROCm / amd_matrix_instruction_calculator
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
☆110Updated 2 months ago
ecrc / kblas-gpu
Subset of BLAS routines optimized for NVIDIA GPUs
☆71Updated 2 years ago
daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆226Updated 3 years ago
c3sr / comm_scope
NUMA-aware multi-CPU multi-GPU data transfer benchmarks
☆24Updated last year
jeffhammond / dpcpp-tutorial
Intel Data Parallel C++ (and SYCL 2020) Tutorial.
☆94Updated 3 years ago
zjin-lcf / HeCBench
☆249Updated last month
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆134Updated last year
RRZE-HPC / gpu-benches
collection of benchmarks to measure basic GPU capabilities
☆401Updated 5 months ago
ROCm / rocprofiler-compute
Advanced Profiling and Analytics for AMD Hardware
☆161Updated this week
NVIDIA / compute-sanitizer-samples
Samples demonstrating how to use the Compute Sanitizer Tools and Public API
☆85Updated last year
intel / xetla
☆62Updated 7 months ago
owensgroup / BGHT
BGHT: High-performance static GPU hash tables.
☆70Updated last month
sunlex0717 / DissectingTensorCores
☆106Updated last year
hummingtree / cuda-graph-with-dynamic-parameters
☆16Updated 2 years ago
ROCm / rocSPARSE
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆129Updated this week