ShadyBoukhary / GPU-research-FFT-OpenACC-CUDA
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆13Updated 6 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA:
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
- Emulating DMA Engines on GPUs for Performance and Portability☆39Updated 9 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆32Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 3 weeks ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 11 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆23Updated last year
- ☆15Updated 5 years ago
- ☆23Updated 2 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 6 months ago
- An extension library of WMMA API (Tensor Core API)☆96Updated 9 months ago
- My notes on various HPC papers.☆22Updated 2 years ago
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆13Updated 10 months ago
- ☆38Updated 5 years ago
- Performance Prediction Toolkit☆51Updated 4 months ago
- Multi-GPU communication profiler and visualizer☆28Updated 10 months ago
- ☆30Updated 2 years ago
- ☆51Updated 5 years ago
- ☆50Updated 5 years ago
- GPU Performance Advisor☆64Updated 2 years ago
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆55Updated 2 years ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆77Updated this week
- CUDA for MNIST training/inference☆40Updated last year
- Efficient SpGEMM on GPU using CUDA and CSR☆52Updated last year
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆81Updated 5 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- ☆25Updated 5 years ago
- Machine Learning System☆14Updated 4 years ago
- gossip: Efficient Communication Primitives for Multi-GPU Systems☆59Updated 2 years ago
- ☆17Updated 3 years ago
- High-performance, GPU-aware communication library☆85Updated 3 months ago