rbga / CUDA-Merge-and-Bitonic-Sort
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆15Updated last year
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- My notes on various HPC papers.☆22Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆131Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆96Updated 10 months ago
- ☆15Updated 5 years ago
- ☆17Updated last year
- CUDA Matrix Multiplication Optimization☆186Updated 9 months ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆66Updated 2 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆144Updated 3 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆52Updated last month
- ☆23Updated 3 years ago
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆98Updated 10 months ago
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆47Updated 3 years ago
- Examples from Programming in Parallel with CUDA☆143Updated 2 years ago
- rocWMMA☆111Updated this week
- ☆67Updated 11 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 11 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks