talhasaruhan / cpp-matmulLinks
Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17
β18Updated 7 years ago
Alternatives and similar repositories for cpp-matmul
Users that are interested in cpp-matmul are comparing it to the libraries listed below
Sorting:
- π GPU load-balancing library for regular and irregular computations.β63Updated 3 months ago
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.β95Updated 3 years ago
- Subset of BLAS routines optimized for NVIDIA GPUsβ74Updated 2 years ago
- Generate simple index ranges in C++ and CUDA C++β39Updated 2 years ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithmβ211Updated last week
- Full-speed Array of Structures accessβ176Updated 2 years ago
- Demonstration of various hardware effects on CUDA GPUs.β390Updated 2 years ago
- β19Updated 6 years ago
- High-performance, GPU-aware communication libraryβ86Updated 11 months ago
- Kernel Tunerβ374Updated this week
- CUDA kernel author's toolsβ114Updated 3 years ago
- Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ templateβ¦β366Updated last year
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).β566Updated 2 months ago
- CUDA and OpenMP implementations of C2R/R2C inplace transpositionβ48Updated 10 years ago
- C++ library for reading and writing of numpy's .npy filesβ422Updated last year
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.β39Updated 8 years ago
- Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernelsβ368Updated this week
- A massively-parallel, block-sparse tensor framework written in C++β309Updated 3 weeks ago
- Performance-portable geometric search libraryβ220Updated last week
- β29Updated 6 years ago
- Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)β113Updated 2 years ago
- The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.β229Updated this week
- Implementation and analysis of five different GPU based SPMV algorithms in CUDAβ40Updated 6 years ago
- β94Updated 8 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorialβ335Updated last week
- An implementation of parallel exclusive scan in CUDAβ65Updated 7 years ago
- Abstraction Library for Parallel Kernel Accelerationβ396Updated this week
- β598Updated this week
- Header-only C++20 wrapper for MPI 4.0.β47Updated 2 years ago
- A C++17 message passing library based on MPIβ178Updated last month