romz-pl / matrix-matrix-multiply
Algorithms for matrix matrix multiplication, dgemm, AVX-256, AVX-512
☆19Updated 3 months ago
Alternatives and similar repositories for matrix-matrix-multiply:
Users that are interested in matrix-matrix-multiply are comparing it to the libraries listed below
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 3 weeks ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 6 months ago
- NPBench - A Benchmarking Suite for High-Performance NumPy☆80Updated 3 weeks ago
- pytorch ucc plugin☆21Updated 3 years ago
- GPU Performance Advisor☆64Updated 2 years ago
- Haystack is an analytical cache model that given a program computes the number of cache misses.☆46Updated 5 years ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated this week
- A Top-Down Profiler for GPU Applications☆17Updated last year
- Linux Cross-Memory Attach☆93Updated 7 months ago
- ☆17Updated 3 years ago
- A task benchmark☆41Updated 8 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆88Updated last month
- A lightweight, Pythonic, frontend for MLIR☆81Updated last year
- Provides a set of benchmarks that can be used to measure the memory bandwidth performance of CPU's☆89Updated last year
- A low-overhead tool to periodically collect system-wide hardware performance counters on Intel64 systems.☆33Updated 2 years ago
- ☆41Updated this week
- RCCL Performance Benchmark Tests☆64Updated this week
- 🎃 GPU load-balancing library for regular and irregular computations.☆62Updated 10 months ago
- A hierarchical collective communications library with portable optimizations☆33Updated 4 months ago
- tools to create performance and roofline plots from measured data☆58Updated 10 years ago
- MLIR-based partitioning system☆80Updated last week
- A tracing infrastructure for heterogeneous computing applications.☆32Updated this week
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆59Updated last month
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆25Updated 2 months ago
- ☆95Updated last year
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆76Updated last week
- ☆11Updated 4 years ago
- CUDA Templates for Linear Algebra Subroutines☆20Updated last week
- Data-Centric MLIR dialect☆40Updated last year
- A Python script to convert the output of NVIDIA Nsight Systems (in SQLite format) to JSON in Google Chrome Trace Event Format.☆33Updated 3 months ago