talhasaruhan / cpp-matmulLinks
Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17
☆18Updated 7 years ago
Alternatives and similar repositories for cpp-matmul
Users that are interested in cpp-matmul are comparing it to the libraries listed below
Sorting:
- 🎃 GPU load-balancing library for regular and irregular computations.☆66Updated 4 months ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆212Updated this week
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- Full-speed Array of Structures access☆176Updated 2 years ago
- Subset of BLAS routines optimized for NVIDIA GPUs☆76Updated 2 years ago
- Online CUDA Occupancy Calculator☆83Updated 4 years ago
- ☆98Updated 8 years ago
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆95Updated 4 years ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆569Updated 4 months ago
- Kokkos C++ Performance Portability Programming Ecosystem: Profiling and Debugging Tools☆138Updated 3 weeks ago
- High-performance, GPU-aware communication library☆86Updated last month
- Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)☆115Updated 2 years ago
- Kernel Tuner☆381Updated last week
- A massively-parallel, block-sparse tensor framework written in C++☆313Updated this week
- A Method for efficiently processing SpMV using SIMD and load balancing☆17Updated 3 years ago
- Performance-portable geometric search library☆219Updated 3 weeks ago
- RAJA Performance Suite☆130Updated this week
- Efficient SpGEMM on GPU using CUDA and CSR☆59Updated 2 years ago
- Reference implementation of the draft C++ GraphBLAS specification.☆32Updated 11 months ago
- Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels☆372Updated last week
- A C++17 message passing library based on MPI☆180Updated 3 months ago
- CUDA kernel author's tools☆115Updated 3 years ago
- Demonstration of various hardware effects on CUDA GPUs.☆391Updated 2 years ago
- SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) sy…☆131Updated 3 months ago
- Abstraction Library for Parallel Kernel Acceleration☆401Updated last week
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆39Updated 8 years ago
- GPU Code optimizer for stencil computations. Refer to our IPDPS'19 paper for more details☆25Updated 6 years ago
- ☆29Updated 6 years ago
- Header-only C++20 wrapper for MPI 4.0.☆47Updated last week
- Distributed View Extension for Kokkos☆49Updated last year