talhasaruhan / cpp-matmulLinks
Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17
☆18Updated 6 years ago
Alternatives and similar repositories for cpp-matmul
Users that are interested in cpp-matmul are comparing it to the libraries listed below
Sorting:
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated last year
- Reference implementation of the draft C++ GraphBLAS specification.☆33Updated 4 months ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆47Updated this week
- A Library for fast Hash Tables on GPUs☆124Updated 2 years ago
- MagmaDNN: a simple deep learning framework in c++☆49Updated 4 years ago
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆46Updated 3 years ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆84Updated this week
- resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI☆21Updated last year
- ☆23Updated 3 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆33Updated 2 months ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆206Updated last month
- Unit benchmarks of CUDA event APIs.☆17Updated last year
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆119Updated this week
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆93Updated 3 years ago
- Distributed View Extension for Kokkos☆46Updated 6 months ago
- Compiler agnostic metaprogramming library providing concepts, type operations and tuples for C++ and cuda☆87Updated this week
- CUDA and OpenMP implementations of C2R/R2C inplace transposition☆46Updated 10 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆74Updated 3 months ago
- AMD’s C++ library for accelerating tensor primitives☆42Updated this week
- Subset of BLAS routines optimized for NVIDIA GPUs☆69Updated 2 years ago
- An implementation of parallel exclusive scan in CUDA☆62Updated 7 years ago
- ☆29Updated 5 years ago
- ☆44Updated 4 years ago
- Sympiler is a Code Generator for Transforming Sparse Matrix Codes☆43Updated last year
- DLA-Future☆75Updated last month
- CUDA kernel author's tools☆111Updated 3 years ago
- Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceler…☆29Updated last year
- fast Fourier transform on GPU in shared memory for AstroAccelerate project☆26Updated 4 years ago