rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆21Updated 2 years ago
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- Personal Notes for Learning HPC & Parallel Computation [NO LONGER ADDING NEW CONTENT]☆76Updated 3 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆146Updated 5 years ago
- CUDA Matrix Multiplication Optimization☆256Updated last year
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- CUTLASS and CuTe Examples☆117Updated 2 months ago
- Examples from Programming in Parallel with CUDA☆170Updated 2 years ago
- Training material for Nsight developer tools☆178Updated last year
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Updated 3 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆60Updated 11 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆66Updated 4 months ago
- ☆120Updated last year
- ☆18Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆77Updated 5 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆134Updated 5 years ago
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆40Updated 6 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆71Updated last year
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆469Updated 2 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆348Updated last month
- Some source code about matrix multiplication implementation on CUDA☆34Updated 7 years ago
- A simple high performance CUDA GEMM implementation.☆426Updated 2 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆403Updated last year
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆322Updated 3 years ago
- ☆112Updated 8 months ago
- ☆111Updated last year
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆13Updated 2 years ago
- IMPACT GPU Algorithms Teaching Labs☆59Updated 2 years ago
- ☆70Updated 11 years ago
- Gallatin is a general-purpose memory manager for CUDA that allows for threads to quickly malloc and free memory of arbitrary size inside …☆25Updated 4 months ago
- Machine Learning Compiler Road Map☆46Updated 2 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Updated last year