rbga / CUDA-Merge-and-Bitonic-SortLinks

Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.

☆16

Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort

Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below

Sorting:

AyakaGEMM / Hands-on-MLIR
☆17Updated last year
mabdullahsoyturk / HPC-Paper-Notes
My notes on various HPC papers.
☆22Updated 2 years ago
intel / graph-compiler
MLIR-based toolkit targeting intel heterogeneous hardware
☆45Updated 4 months ago
Oneflow-Inc / dfccl
☆26Updated 4 months ago
ProjectPhysX / PTXprofiler
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
☆55Updated 3 months ago
owensgroup / GpuBTree
Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019
☆57Updated 3 years ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆201Updated 11 months ago
owensgroup / MVGpuBTree
GPU B-Tree with support for versioning (snapshots).
☆49Updated 8 months ago
amd / amd-lab-notes
AMD lab notes with code examples to demonstrate use of AMD GPUs
☆98Updated last year
illinois-impact / gpu-algorithms-labs
IMPACT GPU Algorithms Teaching Labs
☆58Updated 2 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
☆151Updated 3 years ago
XiaoSong9905 / HPC-Notes
Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]
☆68Updated 2 years ago
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆60Updated 6 months ago
c3sr / comm_scope
NUMA-aware multi-CPU multi-GPU data transfer benchmarks
☆23Updated last year
ysh329 / OpenMP-101
Learn OpenMP examples step by step
☆95Updated 5 months ago
RichardAns / CUDA-Programs
Examples from Programming in Parallel with CUDA
☆157Updated 2 years ago
seb-v / fp32_sgemm_amd
Super fast FP32 matrix multiplication on RDNA3
☆68Updated 3 months ago
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆277Updated last month
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 2 weeks ago
gunrock / loops
🎃 GPU load-balancing library for regular and irregular computations.
☆62Updated last year
CRobeck / instrument-amdgpu-kernels
LLVM/MLIR based compiler instrumentation of AMD GPU kernels
☆18Updated 2 months ago
yuninxia / awesome-gemm
📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
☆46Updated 4 months ago
Lin-Mao / DrGPUM
A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.
☆25Updated 9 months ago
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆62Updated last week
ROCm / amd_matrix_instruction_calculator
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
☆106Updated last month
NVIDIA / nsight-training
Training material for Nsight developer tools
☆160Updated 11 months ago
nvixnu / pmpp__programming_massively_parallel_processors
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…
☆71Updated 4 years ago
ROCm / rocWMMA
rocWMMA
☆118Updated this week
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆137Updated 4 years ago