rbga / CUDA-Merge-and-Bitonic-SortLinks
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆21Updated 2 years ago
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
Sorting:
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆145Updated 5 years ago
- CUTLASS and CuTe Examples☆127Updated 2 months ago
- Personal Notes for Learning HPC & Parallel Computation [NO LONGER ADDING NEW CONTENT]☆77Updated 3 years ago
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- CUDA Matrix Multiplication Optimization☆256Updated last year
- Training material for Nsight developer tools☆177Updated last year
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆473Updated 2 years ago
- ☆18Updated last year
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆13Updated 2 years ago
- ☆112Updated last year
- ☆70Updated 11 years ago
- Examples from Programming in Parallel with CUDA☆170Updated this week
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆350Updated 2 months ago
- Assembler for NVIDIA Volta and Turing GPUs☆238Updated 4 years ago
- My notes on various HPC papers.☆25Updated 3 years ago
- IMPACT GPU Algorithms Teaching Labs☆59Updated 2 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆135Updated 5 years ago
- Dissecting NVIDIA GPU Architecture☆116Updated 3 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Updated last year
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Updated last year
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆40Updated 7 years ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆125Updated 2 months ago
- ☆49Updated 5 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆407Updated last year
- ☆26Updated 11 months ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆60Updated 11 months ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆35Updated 2 years ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆77Updated 5 years ago
- DGEMM on KNL, achieve 75% MKL☆19Updated 3 years ago
- ☆50Updated 6 years ago