yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512FLinks

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

☆155

Alternatives and similar repositories for Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Users that are interested in Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F are comparing it to the libraries listed below

Sorting:

RRZE-HPC / gpu-benches
collection of benchmarks to measure basic GPU capabilities
☆461Updated last month
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆394Updated 11 months ago
InfiniTensor / InfiniTensor
☆274Updated last month
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆112Updated 3 years ago
daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆234Updated 3 years ago
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆419Updated last year
XiaoSong9905 / HPC-Notes
Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]
☆75Updated 3 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆190Updated 10 months ago
nicolaswilde / cuda-tensorcore-hgemm
☆156Updated 11 months ago
cloudcores / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆557Updated 2 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆109Updated last year
sunlex0717 / DissectingTensorCores
☆109Updated last year
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆358Updated 3 years ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆241Updated last year
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆394Updated last month
XiaoSong9905 / CUDA-Optimization-Guide
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
☆318Updated 3 years ago
codyjrivera / tsm2x-imp
Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA
☆35Updated 5 years ago
microsoft / ConvStencil
☆34Updated last year
buddy-compiler / buddy-benchmark
Benchmark Framework for Buddy Projects
☆55Updated last month
reed-lau / cute-gemm
☆151Updated 3 weeks ago
openmlir / mlir-tutorial
Hands-On Practical MLIR Tutorial
☆45Updated 3 months ago
njuhope / cuda_sgemm
☆116Updated last year
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆503Updated last year
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆411Updated 3 years ago
fsword73 / HIP-Performance-Optmization-on-VEGA64
14 basic topics for VEGA64 performance optmization
☆63Updated 4 years ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆146Updated 5 years ago
Cambricon / triton-linalg
Development repository for the Triton-Linalg conversion
☆206Updated 10 months ago
Lewuathe / mlir-hello
MLIR Sample dialect
☆131Updated 9 months ago
zjin-lcf / HeCBench
☆268Updated 3 weeks ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆254Updated 5 months ago