romz-pl / matrix-matrix-multiply
Algorithms for matrix matrix multiplication, dgemm, AVX-256, AVX-512
☆14Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for matrix-matrix-multiply
- NPBench - A Benchmarking Suite for High-Performance NumPy☆73Updated this week
- Online CUDA Occupancy Calculator☆66Updated 3 years ago
- An HPL-AI implementation for Fugaku☆19Updated 3 years ago
- ☆31Updated 4 years ago
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- Serial and parallel implementations of matrix multiplication☆35Updated 3 years ago
- A Library for fast Hash Tables on GPUs☆109Updated 2 years ago
- cuASR: CUDA Algebra for Semirings☆34Updated 2 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆29Updated 2 months ago
- ☆14Updated last month
- MPI+OpenMP implementation of Louvain method for Graph Community Detection, with a number of parallel heuristics/approximate computing tec…☆27Updated last year
- A task benchmark☆40Updated 3 months ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆120Updated last year
- a tester for BLAS libraries including OpenBLAS and Intel MKL. This project is based on ATLAS BLAS Tester☆33Updated last year
- HTML/JS port of CUDA Occupancy Calculator☆16Updated 2 years ago
- Code for paper "Engineering a High-Performance GPU B-Tree" accepted to PPoPP 2019☆52Updated 2 years ago
- MLPerf™ logging library☆30Updated this week
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- Fast Fast Hadamard Transform☆77Updated 2 years ago
- An implementation of ARMCI using MPI one-sided communication (RMA)☆13Updated last month
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆46Updated 2 months ago
- ☆16Updated 7 months ago
- A 128 bit unsigned integer class for CUDA☆43Updated 3 years ago
- Fast, multithreaded, AVX/FMA matrix multiplication kernel in C++ 17☆17Updated 5 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 5 months ago
- benchmarking some transformer deployments☆26Updated last year
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆68Updated last year
- High-performance, GPU-aware communication library☆84Updated last month
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆44Updated last month
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆116Updated 4 years ago