rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆16Updated 2 years ago
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- CUDA Matrix Multiplication Optimization☆220Updated last year
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆70Updated 3 years ago
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆436Updated 2 years ago
- An extension library of WMMA API (Tensor Core API)☆103Updated last year
- Examples from Programming in Parallel with CUDA☆160Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆139Updated 5 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆295Updated this week
- CUTLASS and CuTe Examples☆72Updated last month
- Machine Learning Compiler Road Map☆43Updated last year
- Training material for Nsight developer tools☆163Updated last year
- Optimize GEMM with tensorcore step by step☆32Updated last year
- Some source code about matrix multiplication implementation on CUDA☆34Updated 6 years ago
- Assembler for NVIDIA Volta and Turing GPUs☆229Updated 3 years ago
- A simple high performance CUDA GEMM implementation.☆398Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆374Updated 8 months ago
- CUDA PTX-ISA Document 中文翻译版☆44Updated 3 months ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆106Updated 8 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆131Updated 5 years ago
- Dissecting NVIDIA GPU Architecture☆105Updated 3 years ago
- ☆14Updated 6 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆153Updated 3 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆183Updated 7 months ago
- IMPACT GPU Algorithms Teaching Labs☆58Updated 2 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆34Updated 2 years ago
- ☆153Updated 8 months ago
- A language and compiler for irregular tensor programs.☆149Updated 9 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆64Updated 11 months ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆313Updated 2 years ago
- ☆70Updated 2 years ago
- ☆68Updated 11 years ago