rbga / CUDA-Merge-and-Bitonic-Sort
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sorting of large arrays. Includes both CPU and GPU versions, along with a performance comparison.
☆12Updated last year
Alternatives and similar repositories for CUDA-Merge-and-Bitonic-Sort:
Users that are interested in CUDA-Merge-and-Bitonic-Sort are comparing it to the libraries listed below
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆54Updated 2 years ago
- SNIG: Accelerated Large Sparse Neural Network Inference using Task Graph Parallelism☆34Updated 3 years ago
- ☆65Updated 3 months ago
- ☆40Updated last week
- An extension library of WMMA API (Tensor Core API)☆87Updated 6 months ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆49Updated last year
- BGHT: High-performance static GPU hash tables.☆57Updated 4 months ago
- IMPACT GPU Algorithms Teaching Labs☆56Updated last year
- Class of High Performance Computing taken at U.T.P 2017☆41Updated 7 years ago
- ☆14Updated 9 months ago
- GPU B-Tree with support for versioning (snapshots).☆46Updated 3 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆23Updated 3 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆123Updated 4 years ago
- IREE's PyTorch Frontend, based on Torch Dynamo.☆62Updated this week
- ☆70Updated last year
- TPP experimentation on MLIR for linear algebra☆115Updated this week
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 8 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆59Updated 7 months ago
- Implement Neural Networks in Cuda from Scratch☆22Updated 8 months ago
- 分层解耦的深度学习推理引擎☆70Updated last month
- MLIR-based toolkit targeting intel heterogeneous hardware☆37Updated this week
- My notes on various HPC papers.☆21Updated 2 years ago
- This package includes the implementation for four sparse linear algebra kernels: Sparse-Matrix-Vector-Multiplication (SpMV), Sparse-Trian…☆26Updated 4 years ago
- Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses t…☆26Updated 3 years ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆61Updated 2 years ago
- Examples from Programming in Parallel with CUDA☆117Updated last year
- CUDA Matrix Multiplication Optimization☆155Updated 6 months ago
- LLVM/MLIR based compiler instrumentation of AMD GPU kernels☆17Updated 2 weeks ago
- AMD lab notes with code examples to demonstrate use of AMD GPUs☆94Updated 7 months ago
- A framework that support executing unmodified CUDA source code on non-NVIDIA devices.☆112Updated 3 weeks ago