rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆16Updated last year
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- ☆17Updated last year
- My notes on various HPC papers.☆22Updated 2 years ago
- MLIR-based toolkit targeting intel heterogeneous hardware☆45Updated 4 months ago
- ☆26Updated 4 months ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆57Updated 3 years ago
- CUDA Matrix Multiplication Optimization☆201Updated 11 months ago
- GPU B-Tree with support for versioning (snapshots).☆49Updated 8 months ago
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆98Updated last year
- IMPACT GPU Algorithms Teaching Labs☆58Updated 2 years ago
- An extension library of WMMA API (Tensor Core API)☆99Updated last year
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆151Updated 3 years ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆68Updated 2 years ago
- CUTLASS and CuTe Examples☆60Updated 6 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆23Updated last year
- Learn OpenMP examples step by step☆95Updated 5 months ago
- Examples from Programming in Parallel with CUDA☆157Updated 2 years ago
- Super fast FP32 matrix multiplication on RDNA3☆68Updated 3 months ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆277Updated last month
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆90Updated 2 weeks ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated last year
- LLVM/MLIR based compiler instrumentation of AMD GPU kernels☆18Updated 2 months ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆46Updated 4 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 9 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆62Updated last week
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆106Updated last month
- Training material for Nsight developer tools☆160Updated 11 months ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆71Updated 4 years ago
- rocWMMA☆118Updated this week
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆137Updated 4 years ago