marianhlavac / FFT-cuda
Fast Fourier Transform implementation, computable on CUDA platform. Seminar project for MI-PRC course at FIT CTU.
☆36Updated last year
Related projects ⓘ
Alternatives and complementary repositories for FFT-cuda
- ☆10Updated 4 years ago
- ☆40Updated 3 years ago
- CUDA for MNIST training/inference☆38Updated 10 months ago
- fast Fourier transform on GPU in shared memory for AstroAccelerate project☆26Updated 4 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆35Updated 7 years ago
- GPTPU for SC 2021☆48Updated last year
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆108Updated 2 years ago
- Learning and practice of high performance computing (CUDA, Vulkan, OpenCL, OpenMP, TBB, SSE/AVX, NEON, MPI, coroutines, etc. )☆56Updated last week
- GEMM and Winograd based convolutions using CUTLASS☆25Updated 4 years ago
- ☆38Updated 4 years ago
- Assembler for NVIDIA Volta and Turing GPUs☆200Updated 2 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- Dissecting NVIDIA GPU Architecture☆82Updated 2 years ago
- ☆32Updated 3 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆114Updated 4 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆181Updated last week
- Case studies constitute a modern interdisciplinary and valuable teaching practice which plays a critical and fundamental role in the deve…☆10Updated 6 years ago
- cuDNN sample codes provided by Nvidia☆43Updated 5 years ago
- A Winograd Minimal Filter Implementation in CUDA☆23Updated 3 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆275Updated 2 years ago
- Bridging polyhedral analysis tools to the MLIR framework☆102Updated last year
- Multiple-precision GPU accelerated linear algebra routines (dense and sparse) based on residue number system☆17Updated last year
- My notes on various HPC papers.☆21Updated last year
- ☆128Updated this week
- Some source code about matrix multiplication implementation on CUDA☆35Updated 6 years ago
- how to design cpu gemm on x86 with avx256, that can beat openblas.☆65Updated 5 years ago
- ☆79Updated 6 months ago
- BLAS implementation for Intel FPGA☆76Updated 3 years ago
- Example code for Intel AVX / AVX2 intrinsics.☆125Updated last year