rbga / CUDA-Merge-and-Bitonic-Sort
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆11Updated last year
Related projects ⓘ
Alternatives and complementary repositories for CUDA-Merge-and-Bitonic-Sort
- ☆64Updated last month
- BGHT: High-performance static GPU hash tables.☆55Updated 2 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆116Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- Optimize GEMM with tensorcore step by step☆15Updated 11 months ago
- 大规模并行处理器编程实战 第二版答案☆27Updated 2 years ago
- My notes on various HPC papers.☆21Updated last year
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆43Updated 10 months ago
- Repository holding the code base to AC-SpGEMM : "Adaptive Sparse Matrix-Matrix Multiplication on the GPU"☆28Updated 4 years ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆59Updated 2 years ago
- SNIG: Accelerated Large Sparse Neural Network Inference using Task Graph Parallelism☆34Updated 3 years ago
- IMPACT GPU Algorithms Teaching Labs☆55Updated last year
- Evaluating different memory managers for dynamic GPU memory☆24Updated 3 years ago
- ☆37Updated 3 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆156Updated this week
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 5 months ago
- CUDA PTX-ISA Document 中文翻译版☆26Updated 8 months ago
- ☆10Updated 4 years ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆21Updated last year
- ☆64Updated 10 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Inte…☆16Updated 5 years ago
- TLB Benchmarks☆32Updated 7 years ago
- ☆37Updated this week
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆82Updated last year
- ☆41Updated 4 years ago
- Class of High Performance Computing taken at U.T.P 2017☆34Updated 7 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆25Updated last year
- OpenCL Tutorials☆47Updated 4 years ago