scalable-analyses / smeLinks
☆25Updated 2 months ago
Alternatives and similar repositories for sme
Users that are interested in sme are comparing it to the libraries listed below
Sorting:
- Running linear algebra as fast as possible on Apple silicon☆20Updated last year
- A framework that support executing unmodified CUDA source code on non-NVIDIA devices.☆127Updated 5 months ago
- CPU micro benchmarks☆57Updated 2 weeks ago
- Kernel Extension allows to pin thread on a certain cpu core on Apple Silicon machines☆17Updated 6 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆94Updated 2 months ago
- GPU Performance Advisor☆65Updated 2 years ago
- Provides a set of benchmarks that can be used to measure the memory bandwidth performance of CPU's☆89Updated last year
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 7 months ago
- An HPL-AI implementation for Fugaku☆21Updated 3 years ago
- Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.☆135Updated last week
- A GPU FP32 computation method with Tensor Cores.☆20Updated 2 years ago
- ☆41Updated 2 weeks ago
- Optimize GEMM. With AVX512 and AVX512-BF16, 800x improvement.☆15Updated 4 years ago
- ☆44Updated 4 years ago
- BLIS fork with kernels for Apple M1. (Perhaps) The first open-source BLAS with Apple Matrix Coprocessor support.☆35Updated 2 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆70Updated 2 months ago
- ☆30Updated 2 years ago
- ☆14Updated last year
- Updated C version of the Test Suite for Vectorising Compilers☆61Updated last year
- ☆35Updated 3 years ago
- Triton to TVM transpiler.☆19Updated 7 months ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆52Updated 2 months ago
- ☆96Updated last year
- ☆27Updated last year
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- ☆52Updated 5 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 2 months ago
- The translator that supports translating NVPTX to SPIR-V. This translator is modified from LLVM-SPIR-V Translator.☆39Updated 3 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆131Updated last year
- Bridging polyhedral analysis tools to the MLIR framework☆111Updated last year