ShadyBoukhary / GPU-research-FFT-OpenACC-CUDA
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆10Updated 6 years ago
Related projects ⓘ
Alternatives and complementary repositories for GPU-research-FFT-OpenACC-CUDA
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 6 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated last month
- My notes on various HPC papers.☆21Updated last year
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆46Updated 2 months ago
- ☆29Updated 2 years ago
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆13Updated 5 years ago
- PTX-EMU is a simple emulator for CUDA program.☆24Updated 10 months ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆19Updated 8 months ago
- The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github…☆32Updated this week
- An MLIR-based toy DL compiler for TVM Relay.☆53Updated 2 years ago
- ☆48Updated 8 months ago
- ☆12Updated 2 years ago
- An Attention Superoptimizer☆20Updated 6 months ago
- Emulating DMA Engines on GPUs for Performance and Portability☆34Updated 9 years ago
- Optimize tensor program fast with Felix, a gradient descent autotuner.☆19Updated 6 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- Performance Prediction Toolkit☆51Updated 2 years ago
- ☆40Updated 3 years ago
- ☆15Updated 5 years ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆21Updated last year
- Triton to TVM transpiler.☆16Updated last month
- End to End steps for adding custom ops in PyTorch.☆19Updated 4 years ago
- [CF ’20] Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs☆15Updated 3 years ago
- Source code for the FAST '23 paper “MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems”☆34Updated last year
- ☆13Updated last year
- Fast Fourier Transform implementation, computable on CUDA platform. Seminar project for MI-PRC course at FIT CTU.☆36Updated last year
- A GPU FP32 computation method with Tensor Cores.☆18Updated last year
- An IR for efficiently simulating distributed ML computation.☆25Updated 10 months ago