rbga / CUDA-Merge-and-Bitonic-Sort
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆13Updated last year
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort:
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆58Updated 6 months ago
- An extension library of WMMA API (Tensor Core API)☆93Updated 8 months ago
- ☆17Updated 10 months ago
- IMPACT GPU Algorithms Teaching Labs☆57Updated last year
- ☆15Updated 5 years ago
- Machine Learning Compiler Road Map☆43Updated last year
- ☆70Updated 2 years ago
- Optimize GEMM with tensorcore step by step☆24Updated last year
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆179Updated 2 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆127Updated 4 years ago
- A language and compiler for irregular tensor programs.☆138Updated 4 months ago
- Training material for Nsight developer tools☆152Updated 7 months ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆66Updated 4 years ago
- My notes on various HPC papers.☆22Updated 2 years ago
- CUDA Matrix Multiplication Optimization☆177Updated 8 months ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆251Updated last week
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 9 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 6 months ago
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.☆28Updated 3 months ago
- Examples of CUDA implementations by Cutlass CuTe☆148Updated last month
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆137Updated 3 years ago
- ☆61Updated 3 months ago
- TPP experimentation on MLIR for linear algebra☆121Updated 2 weeks ago
- Class of High Performance Computing taken at U.T.P 2017☆53Updated 7 years ago
- IREE's PyTorch Frontend, based on Torch Dynamo.☆74Updated this week
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆105Updated 7 years ago
- Implement Neural Networks in Cuda from Scratch☆22Updated 10 months ago
- LLVM/MLIR based compiler instrumentation of AMD GPU kernels☆18Updated last week
- ☆90Updated 3 weeks ago