rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆17Updated 2 years ago
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆73Updated 3 years ago
- CUDA Matrix Multiplication Optimization☆222Updated last year
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆143Updated 5 years ago
- Examples from Programming in Parallel with CUDA☆161Updated 2 years ago
- ☆257Updated last week
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆54Updated 7 months ago
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆441Updated 2 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆153Updated 3 years ago
- An extension library of WMMA API (Tensor Core API)☆105Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆382Updated 8 months ago
- Dissecting NVIDIA GPU Architecture☆105Updated 3 years ago
- ☆17Updated last year
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆132Updated 5 years ago
- Isolating mlir tutorial dialect implementation☆25Updated last month
- ☆114Updated last year
- ☆153Updated 9 months ago
- CUTLASS and CuTe Examples☆84Updated this week
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆314Updated 2 years ago
- Machine Learning Compiler Road Map☆44Updated 2 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆299Updated 3 weeks ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆65Updated last year
- Optimize GEMM with tensorcore step by step☆32Updated last year
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆40Updated 6 years ago
- A simple high performance CUDA GEMM implementation.☆406Updated last year
- MLIR Sample dialect☆129Updated 7 months ago
- ☆108Updated last year
- Training material for Nsight developer tools☆167Updated last year
- CUDA PTX-ISA Document 中文翻译版☆44Updated 3 months ago
- collection of benchmarks to measure basic GPU capabilities☆419Updated 7 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆186Updated 7 months ago