romz-pl / matrix-matrix-multiplyLinks
Algorithms for matrix matrix multiplication, dgemm, AVX-256, AVX-512
☆24Updated last year
Alternatives and similar repositories for matrix-matrix-multiply
Users that are interested in matrix-matrix-multiply are comparing it to the libraries listed below
Sorting:
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆201Updated 6 months ago
- Advanced Matrix Extensions (AMX) Guide☆109Updated 4 years ago
- An implementation of HPL-AI Mixed-Precision Benchmark based on hpl-2.3☆29Updated 4 years ago
- ☆77Updated last year
- Official page for 18-847C (Spring '22): Data Center Computing☆15Updated 3 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆93Updated 2 years ago
- IMPACT GPU Algorithms Teaching Labs☆59Updated 2 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Updated 4 years ago
- Tutorials for NVIDIA CUPTI samples☆52Updated 3 months ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆98Updated last year
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆57Updated 10 months ago
- A Top-Down Profiler for GPU Applications☆22Updated last year
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆49Updated 5 months ago
- GPUOcelot: A dynamic compilation framework for PTX☆219Updated last year
- Forked from https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/☆25Updated 6 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆66Updated 5 months ago
- ☆19Updated 9 years ago
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- ☆11Updated 2 years ago
- ☆67Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆77Updated 5 years ago
- Provides a set of benchmarks that can be used to measure the memory bandwidth performance of CPU's☆92Updated last year
- An experimental CPU backend for Triton☆175Updated 3 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆201Updated this week
- ☆54Updated 9 months ago
- Stanford CS149 -- Assignment 1☆145Updated 3 months ago
- Online CUDA Occupancy Calculator☆83Updated 4 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆136Updated 5 years ago
- GPU Performance Advisor☆65Updated 3 years ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Updated 7 months ago