knotman90 / cuStreamComp
Efficient CUDA Stream Compaction Library
☆33Updated last year
Related projects ⓘ
Alternatives and complementary repositories for cuStreamComp
- Full-speed Array of Structures access☆161Updated last year
- Fast integer division with divisor not known at compile time. To be used primarily in CUDA kernels.☆70Updated 9 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆99Updated 7 years ago
- CUDA implementation of exclusive prefix sum via Blelloch's algorithm☆25Updated 7 years ago
- A warp-oriented dynamic hash table for GPUs☆71Updated 10 months ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆35Updated 7 years ago
- CUDA and OpenMP implementations of C2R/R2C inplace transposition☆45Updated 9 years ago
- CUDA kernel author's tools☆109Updated 2 years ago
- A portable high-level API with CUDA or OpenCL back-end☆54Updated 7 years ago
- Example of how to use CUDA with CMake >= 3.8☆69Updated last year
- A fast and highly scalable GPU dynamic memory allocator☆103Updated 9 years ago
- Some CUDA design patterns and a bit of template magic for CUDA☆146Updated last year
- BGHT: High-performance static GPU hash tables.☆55Updated 2 months ago
- Launching collective tasks in bulk☆36Updated 5 years ago
- An implementation of parallel exclusive scan in CUDA☆59Updated 6 years ago
- Range-based for loops to iterate over a range of numbers or values☆35Updated 7 years ago
- CUDA Data Parallel Primitives Library☆421Updated 6 years ago
- a CUDA implementation of a priority queue☆81Updated 4 years ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- ☆67Updated 2 years ago
- Communication-Minimizing 2D Convolution in GPU Registers☆30Updated 11 years ago
- TTC: A high-performance Compiler for Tensor Transpositions☆20Updated 7 years ago
- portDNN is a library implementing neural network algorithms written using SYCL☆108Updated 6 months ago
- CUDA Tensor Transpose (cuTT) library☆50Updated 7 years ago
- Use CUDA intrinsics with user-defined types☆47Updated 10 years ago
- A Library for fast Hash Tables on GPUs☆109Updated 2 years ago
- Corrected source for the OpenCL in Action book (work in progress)☆61Updated 11 years ago
- Greentea LibDNN - a universal convolution implementation supporting CUDA and OpenCL☆135Updated 7 years ago
- Efficient Top-K implementation on the GPU☆149Updated 5 years ago
- A library to benchmark CUDA code, similar to google benchmark.☆28Updated 3 years ago