mz24cn / gemm_optimizationLinks

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

☆17

Alternatives and similar repositories for gemm_optimization

Users that are interested in gemm_optimization are comparing it to the libraries listed below

Sorting:

cjmcv / hpc
Learning and practice of high performance computing (CUDA, Vulkan, OpenCL, OpenMP, TBB, SSE/AVX, NEON, MPI, coroutines, etc. )
☆62Updated 7 months ago
pigirons / spmv
This is a tuned sparse matrix dense vector multiplication(SpMV) library
☆22Updated 9 years ago
ysh329 / OpenCL-101
Learn OpenCL step by step.
☆135Updated 3 years ago
cyanguwa / nersc-roofline
☆48Updated 5 years ago
carlushuang / cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
☆73Updated 6 years ago
ap-hynninen / cutt
CUDA Tensor Transpose (cuTT) library
☆53Updated 8 years ago
md2z34 / winograd_gpu
GPU implementation of Winograd convolution
☆10Updated 8 years ago
passlab / CUDAMicroBench
☆46Updated 4 months ago
lixiuhong / batched_gemm
☆39Updated 5 years ago
yester31 / Cutlass_EX
study of cutlass
☆22Updated last year
csehydrogen / Winograd-OpenCL
Winograd-based convolution implementation in OpenCL
☆28Updated 8 years ago
PeterTh / uCLbench
Set of OpenCL microbenchmarks
☆29Updated this week
OpenPPL / ppl.kernel.cpu
☆19Updated last year
OpenPPL / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆84Updated 2 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆108Updated last year
NVlabs / cub
THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.
☆85Updated last year
XiuYuLi / flexible-gemm
flexible-gemm conv of deepcore
☆17Updated 5 years ago
CSshengxy / MEC
ICML2017 MEC: Memory-efficient Convolution for Deep Neural Network C++实现(非官方)
☆17Updated 6 years ago
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆109Updated 3 years ago
enp1s0 / ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
☆90Updated 7 months ago
UDC-GAC / openCNN
A Winograd Minimal Filter Implementation in CUDA
☆28Updated 4 years ago
intel / xetla
☆62Updated 10 months ago
rox906 / tcFFT
☆41Updated 4 years ago
weifengliu-ssslab / Benchmark_SpGEMM_using_CSR
CSR-based SpGEMM on nVidia and AMD GPUs
☆46Updated 9 years ago
spcl / FBLAS
BLAS implementation for Intel FPGA
☆77Updated 4 years ago
PAA-NCIC / PPoPP2017_artifact
Third party assembler and GEMM library for NVIDIA Kepler GPU
☆82Updated 6 years ago
QianyanTech / NBAssembler
Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.
☆91Updated 2 years ago
Syencil / Programming_Massively_Parallel_Processors
CUDA 6大并行计算模式代码与笔记
☆61Updated 5 years ago
ekondis / gpumembench
A GPU benchmark suite for assessing on-chip GPU memory bandwidth
☆108Updated 8 years ago
Tencent / BlazerML-tvm
Tencent Distribution of TVM
☆15Updated 2 years ago