ShadyBoukhary / GPU-research-FFT-OpenACC-CUDALinks
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆13Updated 7 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
Sorting:
- My notes on various HPC papers.☆22Updated 2 years ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 9 months ago
- ☆14Updated 6 years ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 5 months ago
- ☆18Updated 3 years ago
- GPU Performance Advisor☆66Updated 3 years ago
- ☆16Updated 2 years ago
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆16Updated last year
- An MLIR-based toy DL compiler for TVM Relay.☆59Updated 2 years ago
- ☆14Updated 3 years ago
- ☆27Updated 7 months ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆31Updated 7 months ago
- An extension library of WMMA API (Tensor Core API)☆104Updated last year
- Multi-GPU communication profiler and visualizer☆32Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆97Updated 2 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 11 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆24Updated last year
- A GPU-accelerated DNN inference serving system that supports instant kernel preemption and biased concurrent execution in GPU scheduling.☆43Updated 3 years ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆20Updated last year
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆57Updated 3 years ago
- DISB is a new DNN inference serving benchmark with diverse workloads and models, as well as real-world traces.☆53Updated last year
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆22Updated 4 months ago
- A GPU FP32 computation method with Tensor Cores.☆21Updated 2 years ago
- PTX-EMU is a simple emulator for CUDA program.☆34Updated 4 months ago
- Emulating DMA Engines on GPUs for Performance and Portability☆41Updated 10 years ago
- IMPACT GPU Algorithms Teaching Labs☆58Updated 2 years ago
- ☆31Updated 3 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆185Updated 7 months ago
- An Attention Superoptimizer☆22Updated 7 months ago