talhasaruhan / cpp-matmul
Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17
☆18Updated 6 years ago
Alternatives and similar repositories for cpp-matmul:
Users that are interested in cpp-matmul are comparing it to the libraries listed below
- 🎃 GPU load-balancing library for regular and irregular computations.☆58Updated 7 months ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- Implementation and analysis of five different GPU based SPMV algorithms in CUDA☆37Updated 5 years ago
- MagmaDNN: a simple deep learning framework in c++☆48Updated 4 years ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆45Updated 3 months ago
- DLA-Future☆69Updated this week
- Reference implementation of the draft C++ GraphBLAS specification.☆29Updated 11 months ago
- Efficient SpGEMM on GPU using CUDA and CSR☆50Updated last year
- CUDA and OpenMP implementations of C2R/R2C inplace transposition☆46Updated 9 years ago
- Sympiler is a Code Generator for Transforming Sparse Matrix Codes☆42Updated last year
- ☆23Updated 2 years ago
- A library to benchmark CUDA code, similar to google benchmark.☆28Updated 3 years ago
- A portable implementation of SZ lossy compression for AMD GPUs and Hygon DCUs.☆7Updated last month
- Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceler…☆27Updated 6 months ago
- High-performance, GPU-aware communication library☆84Updated last week
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆22Updated last year
- Header-only C++20 wrapper for MPI 4.0.☆44Updated last year
- Distributed View Extension for Kokkos☆43Updated last month
- Compiler agnostic metaprogramming library providing concepts, type operations and tuples for C++ and cuda☆83Updated this week
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆36Updated 7 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆47Updated last year
- ☆26Updated 5 years ago
- sparse matrix pre-processing library☆81Updated 8 months ago
- C++ Template Linear Algebra PACKage☆43Updated last week
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- Distributed Communication-Optimal LU-factorization Algorithm☆12Updated 3 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆122Updated 4 years ago
- Subset of BLAS routines optimized for NVIDIA GPUs☆67Updated last year
- High-performance Geometric Multigrid☆33Updated 5 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆30Updated last month