talhasaruhan / cpp-matmulLinks
Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17
☆18Updated 6 years ago
Alternatives and similar repositories for cpp-matmul
Users that are interested in cpp-matmul are comparing it to the libraries listed below
Sorting:
- Generate simple index ranges in C++ and CUDA C++☆39Updated last year
- Reference implementation of the draft C++ GraphBLAS specification.☆33Updated 3 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 11 months ago
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- MagmaDNN: a simple deep learning framework in c++☆49Updated 4 years ago
- Compiler agnostic metaprogramming library providing concepts, type operations and tuples for C++ and cuda☆87Updated 3 weeks ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆52Updated 2 months ago
- ☆91Updated 8 years ago
- Sympiler is a Code Generator for Transforming Sparse Matrix Codes☆43Updated last year
- Tensor library for c++☆14Updated 5 years ago
- The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear …☆75Updated last week
- DLA-Future☆74Updated 2 weeks ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆206Updated 3 weeks ago
- Subset of BLAS routines optimized for NVIDIA GPUs☆68Updated 2 years ago
- CUDA and OpenMP implementations of C2R/R2C inplace transposition☆46Updated 10 years ago
- Distributed Communication-Optimal LU-factorization Algorithm☆12Updated 3 years ago
- Examples for using SYCL on CUDA☆62Updated 3 months ago
- ☆29Updated 2 weeks ago
- portDNN is a library implementing neural network algorithms written using SYCL☆113Updated last year
- Unit benchmarks of CUDA event APIs.☆17Updated last year
- ☆23Updated 3 years ago
- ☆29Updated 5 years ago
- SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) sy…☆120Updated last week
- GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.☆35Updated 2 months ago
- Fast and full-featured Matrix Market I/O library for C++, Python, and R☆79Updated 10 months ago
- RAJA Performance Suite☆117Updated last week
- Distributed ranges is a generalization of C++ ranges for distributed data structures.☆51Updated 3 weeks ago
- BGHT: High-performance static GPU hash tables.☆65Updated 2 months ago
- Portable HPC Containers (C++)☆48Updated this week
- CUDA kernel author's tools☆111Updated 3 years ago