roguh / cuda-fftLinks
Yet another FFT implementation in CUDA. Includes benchmarks using simple data for comparing different implementations.
☆12Updated 4 years ago
Alternatives and similar repositories for cuda-fft
Users that are interested in cuda-fft are comparing it to the libraries listed below
Sorting:
- Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the deve…☆13Updated 7 years ago
- Fast Fourier Transform Acceleration Algorithm. (Accelerated by CUDA)☆11Updated 7 years ago
- Examples from Programming in Parallel with CUDA☆170Updated last week
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆322Updated 3 years ago
- BLISlab: A Sandbox for Optimizing GEMM☆555Updated 4 years ago
- Step-by-step optimization of CUDA SGEMM☆428Updated 3 years ago
- A simple high performance CUDA GEMM implementation.☆426Updated 2 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Updated 4 years ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆212Updated last week
- 2018 并行计算课程 repo☆33Updated 4 years ago
- ☆290Updated 5 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆350Updated 2 months ago
- ☆484Updated 10 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆407Updated last year
- ☆70Updated last year
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆475Updated 2 years ago
- Personal Notes for Learning HPC & Parallel Computation [NO LONGER ADDING NEW CONTENT]☆77Updated 3 years ago
- ☆120Updated last year
- IMPACT GPU Algorithms Teaching Labs☆59Updated 2 years ago
- Example code for Intel AVX / AVX2 intrinsics.☆144Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆145Updated 5 years ago
- A sparse BLAS lib supporting multiple backends☆49Updated 2 months ago
- An implementation of parallel exclusive scan in CUDA☆65Updated 7 years ago
- 14 basic topics for VEGA64 performance optmization☆63Updated 4 years ago
- This is an implementation of sgemm_kernel on L1d cache.☆233Updated last year
- ☆43Updated 4 years ago
- CUDA Matrix Multiplication Optimization☆256Updated last year
- row-major matmul optimization☆701Updated 5 months ago
- Main Book repository for the Parallel and High Performance Computing book, Manning Publications☆226Updated 3 years ago
- ☆69Updated 2 years ago