knotman90 / cuStreamComp
Efficient CUDA Stream Compaction Library
☆33Updated last year
Alternatives and similar repositories for cuStreamComp:
Users that are interested in cuStreamComp are comparing it to the libraries listed below
- Full-speed Array of Structures access☆164Updated last year
- BGHT: High-performance static GPU hash tables.☆61Updated 5 months ago
- Fast integer division with divisor not known at compile time. To be used primarily in CUDA kernels.☆71Updated 9 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆104Updated 7 years ago
- A portable high-level API with CUDA or OpenCL back-end☆54Updated 7 years ago
- Some CUDA design patterns and a bit of template magic for CUDA☆148Updated last year
- ☆67Updated 2 years ago
- A fast and highly scalable GPU dynamic memory allocator☆104Updated 9 years ago
- Communication-Minimizing 2D Convolution in GPU Registers☆30Updated 11 years ago
- CUDA implementation of parallel radix sort using Blelloch scan☆62Updated last year
- CUDA kernel author's tools☆110Updated 2 years ago
- portDNN is a library implementing neural network algorithms written using SYCL☆111Updated 9 months ago
- Lock-free parallel disjoint set data structure (aka UNION-FIND) with path compression and union by rank☆64Updated 9 years ago
- A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory☆296Updated 6 years ago
- An implementation of parallel exclusive scan in CUDA☆62Updated 7 years ago
- Efficient Top-K implementation on the GPU☆155Updated 5 years ago
- CUDA Data Parallel Primitives Library☆427Updated 6 years ago
- A warp-oriented dynamic hash table for GPUs☆74Updated last year
- A Library for fast Hash Tables on GPUs☆114Updated 2 years ago
- A library to benchmark CUDA code, similar to google benchmark.☆28Updated 3 years ago
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆80Updated 5 years ago
- CUDA implementation of exclusive prefix sum via Blelloch's algorithm☆27Updated 7 years ago
- a CUDA implementation of a priority queue☆83Updated 4 years ago
- Launching collective tasks in bulk☆37Updated 5 years ago
- CNNs in Halide☆23Updated 9 years ago
- Use CUDA intrinsics with user-defined types☆47Updated 10 years ago
- a heterogeneous multiGPU level-3 BLAS library☆45Updated 5 years ago
- Corrected source for the OpenCL in Action book (work in progress)☆62Updated 11 years ago
- CLTune: An automatic OpenCL & CUDA kernel tuner☆174Updated 2 years ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆522Updated 2 weeks ago