ShadyBoukhary / GPU-research-FFT-OpenACC-CUDA
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆13Updated 6 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA:
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆31Updated 3 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆23Updated last year
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 5 months ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 10 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- My notes on various HPC papers.☆22Updated 2 years ago
- ☆15Updated 5 years ago
- Emulating DMA Engines on GPUs for Performance and Portability☆38Updated 9 years ago
- ☆30Updated 2 years ago
- A tool for examining GPU scheduling behavior.☆74Updated 7 months ago
- High-performance, GPU-aware communication library☆85Updated 2 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 4 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆18Updated 2 years ago
- My tests and experiments with some popular dl frameworks.☆12Updated last month
- GPU Performance Advisor☆64Updated 2 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆32Updated last year
- Cavs: An Efficient Runtime System for Dynamic Neural Networks☆14Updated 4 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆57Updated last week
- Slides and exercises for persistent memory programming tutorial☆12Updated 2 years ago
- ☆39Updated 5 years ago
- A hierarchical collective communications library with portable optimizations☆32Updated 3 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆127Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last week
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆13Updated 9 months ago
- ☆43Updated 4 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆36Updated 7 years ago
- An MLIR-based toy DL compiler for TVM Relay.☆58Updated 2 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆23Updated last month
- ☆49Updated last year