iVishalr / GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
☆23Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for GEMM
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆81Updated last year
- Optimize GEMM. With AVX512 and AVX512-BF16, 800x improvement.☆14Updated 4 years ago
- ☆16Updated 5 months ago
- An extension library of WMMA API (Tensor Core API)☆82Updated 3 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆114Updated 4 years ago
- ☆78Updated 6 months ago
- CUDA Matrix Multiplication Optimization☆139Updated 3 months ago
- Some source code about matrix multiplication implementation on CUDA☆35Updated 6 years ago
- A language and compiler for irregular tensor programs.☆134Updated 6 months ago
- ☆40Updated 3 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 4 months ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆65Updated last year
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆148Updated this week
- Experiments and prototypes associated with IREE or MLIR☆49Updated 3 months ago
- ☆29Updated 2 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆42Updated 10 months ago
- A lightweight, Pythonic, frontend for MLIR☆79Updated last year
- The missing pieces (as far as boilerplate reduction goes) of the upstream MLIR python bindings.☆66Updated last week
- TPP experimentation on MLIR for linear algebra☆111Updated 2 weeks ago
- development repository for the open earth compiler☆77Updated 3 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆46Updated 2 months ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- ☆44Updated 5 years ago
- GPU Performance Advisor☆58Updated 2 years ago
- Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.☆123Updated this week
- MLIRX is now defunct. Please see PolyBlocks - https://docs.polymagelabs.com☆38Updated 11 months ago
- NUMA-aware multi-CPU multi-GPU data transfer benchmarks☆21Updated last year
- ☆41Updated 4 years ago
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆49Updated 3 months ago
- Data-Centric MLIR dialect☆38Updated last year