ShadyBoukhary / GPU-research-FFT-OpenACC-CUDALinks
Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the development of new skills and the formation of new knowledge. This research studies the behavior and performance of two interdisciplinary and widely adopted scientific kernels, a Fast Fourier Transform and Matrix M…
☆13Updated 7 years ago
Alternatives and similar repositories for GPU-research-FFT-OpenACC-CUDA
Users that are interested in GPU-research-FFT-OpenACC-CUDA are comparing it to the libraries listed below
Sorting:
- ☆18Updated 3 years ago
- My notes on various HPC papers.☆23Updated 2 years ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 11 months ago
- Slides and exercises for persistent memory programming tutorial☆14Updated 2 years ago
- ☆14Updated 6 years ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 6 months ago
- GPU Performance Advisor☆65Updated 3 years ago
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆18Updated last year
- gossip: Efficient Communication Primitives for Multi-GPU Systems☆59Updated 3 years ago
- A GPU FP32 computation method with Tensor Cores.☆21Updated 2 years ago
- An MLIR-based toy DL compiler for TVM Relay.☆59Updated 3 years ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆26Updated last year
- Official page for 18-847C (Spring '22): Data Center Computing☆15Updated 3 years ago
- An MLIR-based AI compiler designed for Python frontend to RISC-V DSA☆12Updated last year
- ☆27Updated 8 months ago
- High performance NCCL plugin for Bagua.☆15Updated 4 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆31Updated 8 months ago
- Simple PyTorch profiler that combines DeepSpeed Flops Profiler and TorchInfo☆11Updated 2 years ago
- ☆14Updated 2 weeks ago
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆55Updated 3 years ago
- A fast and accurate reuse distance analyzer for multi-threaded applications. It leverages existing hardware features in commodity CPUs.☆20Updated 2 years ago
- Handwritten GEMM using Intel AMX (Advanced Matrix Extension)☆16Updated 9 months ago
- ☆16Updated 2 years ago
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 5 months ago
- Tutorials for NVIDIA CUPTI samples☆36Updated 2 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆25Updated 2 years ago
- ☆12Updated 6 months ago
- STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth☆17Updated 2 years ago
- IMPACT GPU Algorithms Teaching Labs☆58Updated 2 years ago