rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆20Updated 2 years ago
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆146Updated 5 years ago
- CUDA Matrix Multiplication Optimization☆247Updated last year
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆75Updated 3 years ago
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- CUTLASS and CuTe Examples☆114Updated 3 weeks ago
- Examples from Programming in Parallel with CUDA☆169Updated 2 years ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆76Updated 4 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆59Updated 10 months ago
- ☆71Updated 11 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆158Updated 3 years ago
- Step-by-step optimization of CUDA SGEMM☆416Updated 3 years ago
- ☆156Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆397Updated 11 months ago
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆462Updated 2 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆190Updated 10 months ago
- Optimize GEMM with tensorcore step by step☆36Updated 2 years ago
- Dissecting NVIDIA GPU Architecture☆115Updated 3 years ago
- A simple high performance CUDA GEMM implementation.☆421Updated last year
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆40Updated 6 years ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆323Updated 3 years ago
- CUDA PTX-ISA Document 中文翻译版☆47Updated 2 months ago
- Examples of CUDA implementations by Cutlass CuTe☆263Updated 5 months ago
- ☆110Updated last year
- Assembler for NVIDIA Volta and Turing GPUs☆235Updated 3 years ago
- ☆116Updated last year
- IMPACT GPU Algorithms Teaching Labs☆58Updated 2 years ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆123Updated last month
- ☆17Updated last year
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆13Updated 2 years ago
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆141Updated 4 years ago