ShadyBoukhary / GPU-research-FFT-OpenACC-CUDA
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆12Updated 6 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA:
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated 3 months ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 8 months ago
- My notes on various HPC papers.☆21Updated 2 years ago
- Emulating DMA Engines on GPUs for Performance and Portability☆35Updated 9 years ago
- [CF ’20] Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs☆15Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆30Updated last month
- Slides and exercises for persistent memory programming tutorial☆12Updated 2 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- Fast GPU error-bounded lossy compressor for floating-point data.☆29Updated 3 weeks ago
- Prototype of OpenSHMEM for NVIDIA GPUs, developed as part of DoE Design Forward☆20Updated 6 years ago
- A GPU accelerated error-bounded lossy compression for scientific data.☆69Updated this week
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆19Updated 10 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆21Updated last year
- An Attention Superoptimizer☆20Updated 8 months ago
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆54Updated 2 years ago
- fast Fourier transform on GPU in shared memory for AstroAccelerate project☆26Updated 4 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆18Updated 3 years ago
- High-performance, GPU-aware communication library☆84Updated last week
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆36Updated 7 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆47Updated last year
- A GPU-based LZSS compression algorithm, highly tuned for NVIDIA GPGPUs and for streaming data, leveraging the respective strengths of CPU…☆36Updated 9 years ago
- Performance Prediction Toolkit☆51Updated last month
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆14Updated 5 years ago
- An HPL-AI implementation for Fugaku☆19Updated 3 years ago
- QCD for Intel Xeon Phi and Xeon processors☆14Updated 9 months ago
- Fast Fourier Transform implementation, computable on CUDA platform. Seminar project for MI-PRC course at FIT CTU.☆37Updated last year
- A GPU FP32 computation method with Tensor Cores.☆19Updated 2 years ago
- ☆30Updated 2 years ago
- GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs☆14Updated 10 months ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆48Updated this week