nickjillings / bitonic-sort
Bitonic Sort for C and CUDA
☆15Updated 6 years ago
Alternatives and similar repositories for bitonic-sort:
Users that are interested in bitonic-sort are comparing it to the libraries listed below
- Lock-free parallel disjoint set data structure (aka UNION-FIND) with path compression and union by rank☆64Updated 9 years ago
- Whippletree, a novel approach to scheduling dynamic, irregular workloads on the GPU☆21Updated 9 years ago
- SuiteSparse: a suite of sparse matrix packages by @DrTimothyAldenDavis et al. with native CMake support☆53Updated 9 months ago
- GPU B-Tree with support for versioning (snapshots).☆47Updated 5 months ago
- C++ convenience classes to be used with CUDA code, for both the host and the kerlel parts.☆55Updated 6 years ago
- Giddy - A lightweight GPU decompression library☆42Updated 5 years ago
- Full-speed Array of Structures access☆169Updated last year
- Implementation of a few sorting algorithms in OpenCL☆35Updated 5 years ago
- Parallel k-D Tree Construction☆57Updated 13 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last month
- Corrected source for the OpenCL in Action book (work in progress)☆64Updated 11 years ago
- Abstractions of memory, allocator, vector, tuple, shared_ptr, unique_ptr, bitset, variant and string working on both CPU and GPU☆30Updated last week
- A C/C++ task-based programming model for shared memory and distributed parallel computing.☆71Updated 4 years ago
- ☆44Updated 7 years ago
- This example builds on the parallel-forall repo separate compilation example by adding CMake to it.☆17Updated 7 years ago
- CMake Examples (CMake, CMake+CUDA, CMake+CUDA+PandaRoot)☆41Updated 11 years ago
- TTC: A high-performance Compiler for Tensor Transpositions☆20Updated 7 years ago
- a heterogeneous multiGPU level-3 BLAS library☆45Updated 5 years ago
- a CUDA implementation of a priority queue☆84Updated 4 years ago
- Polyfill some holes in the SSE intrinsics set☆50Updated 2 years ago
- A GPU-based LZSS compression algorithm, highly tuned for NVIDIA GPGPUs and for streaming data, leveraging the respective strengths of CPU…☆35Updated 9 years ago
- GPU Optimization and Memory Abstraction Framework☆32Updated 5 years ago
- Communication-Minimizing 2D Convolution in GPU Registers☆30Updated 11 years ago
- WIP for a k-d-tree implementation in CUDA☆35Updated 2 years ago
- Shared Memory, Message Passing, and Hybrid Merge Sort: UPC, OpenMP, MPI and Hybrid Implementations☆14Updated last year
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- Library with JIT (Just-in-time) compilation support to optimize performance of small and medium matrix multiplication☆14Updated 3 years ago
- Fastest CPU (AVX/SSE) RGB to grayscale: 2-4x faster than OpenCV. For image processing/computer vision.☆91Updated 4 years ago
- Fast integer division with divisor not known at compile time. To be used primarily in CUDA kernels.☆70Updated 9 years ago
- mallocMC: Memory Allocator for Many Core Architectures☆55Updated 3 weeks ago