KAdamek / SMFFTLinks
fast Fourier transform on GPU in shared memory for AstroAccelerate project
☆27Updated 4 years ago
Alternatives and similar repositories for SMFFT
Users that are interested in SMFFT are comparing it to the libraries listed below
Sorting:
- Subset of BLAS routines optimized for NVIDIA GPUs☆73Updated 2 years ago
- BLAS implementation for Intel FPGA☆77Updated 4 years ago
- Kernel Tuning Toolkit☆64Updated 2 months ago
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆84Updated last year
- A 128 bit unsigned integer class for CUDA☆46Updated 8 months ago
- CUDA tool set for non-C++ languages that provides similar functionality like Thrust, with NVRTC at its core.☆59Updated 3 years ago
- The SparseX sparse kernel optimization library☆41Updated 6 years ago
- Fast Fast Hadamard Transform☆84Updated 3 years ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆195Updated this week
- CUDA accelerated(X) Multi-Precision library☆92Updated 9 years ago
- A GPU-based LZSS compression algorithm, highly tuned for NVIDIA GPGPUs and for streaming data, leveraging the respective strengths of CPU…☆35Updated 9 years ago
- sparse matrix pre-processing library☆83Updated last year
- tools to create performance and roofline plots from measured data☆59Updated 11 years ago
- Full-speed Array of Structures access☆173Updated 2 years ago
- High-performance, GPU-aware communication library☆86Updated 8 months ago
- A C++ allocator based on cudaMallocManaged☆23Updated 6 years ago
- The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear …☆79Updated last month
- A domain-specific language and compiler for image processing☆76Updated 4 years ago
- A unified framework across multiple programming platforms☆41Updated 3 months ago
- Tensor Contraction Code Generator☆39Updated 8 years ago
- The Surprisingly ParalleL spArse Tensor Toolkit.☆71Updated 3 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆39Updated 8 years ago
- CUDA templates for tile-sparse matrix multiplication based on CUTLASS.☆50Updated 7 years ago
- CUDA and OpenMP implementations of C2R/R2C inplace transposition☆48Updated 10 years ago
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆95Updated 3 years ago
- Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.☆262Updated 8 months ago
- ☆94Updated 8 years ago
- C++ Header-Only Library for High-Performance Tensor-Vector Multiplication☆22Updated 9 months ago
- C++ HPC Tutorial materials☆55Updated last year
- RAJA Performance Suite☆123Updated this week