ShadyBoukhary / GPU-research-FFT-OpenACC-CUDALinks
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆14Updated 6 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
Sorting:
- hardware test for CPU,GPU,I/O,memory bandwidth performance☆25Updated 6 years ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 9 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆33Updated 3 months ago
- My notes on various HPC papers.☆22Updated 2 years ago
- ☆17Updated 3 years ago
- Slides and exercises for persistent memory programming tutorial☆13Updated 2 years ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated last year
- ☆15Updated 6 years ago
- An MLIR-based toy DL compiler for TVM Relay.☆58Updated 2 years ago
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆15Updated last year
- Triton to TVM transpiler.☆21Updated 8 months ago
- ☆31Updated 3 years ago
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆57Updated 3 years ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆18Updated 2 years ago
- Emulating DMA Engines on GPUs for Performance and Portability☆40Updated 10 years ago
- Finite Field Operations on GPGPU☆15Updated last year
- GPU Performance Advisor☆65Updated 2 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆20Updated last year
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 7 months ago
- An Attention Superoptimizer☆22Updated 5 months ago
- A Top-Down Profiler for GPU Applications☆20Updated last year
- Official page for 18-847C (Spring '22): Data Center Computing☆17Updated 3 years ago
- Performance Prediction Toolkit☆52Updated 6 months ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆27Updated 5 months ago
- Mu: Microsecond Consensus for Microsecond Applications☆40Updated 4 years ago
- ☆26Updated 4 months ago
- Asynchronous semantics for architectural simulation and synthesis.☆39Updated this week
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆76Updated 3 months ago
- Source code for the FAST '23 paper “MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems”☆41Updated 2 years ago