yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUsLinks

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

☆394

Alternatives and similar repositories for Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Users that are interested in Optimizing-SGEMM-on-NVIDIA-Turing-GPUs are comparing it to the libraries listed below

Sorting:

Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆418Updated last year
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆358Updated 3 years ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆503Updated last year
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆394Updated last month
nicolaswilde / cuda-tensorcore-hgemm
☆156Updated 11 months ago
reed-lau / cute-gemm
☆151Updated 3 weeks ago
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆405Updated 3 years ago
XiaoSong9905 / CUDA-Optimization-Guide
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
☆318Updated 3 years ago
RRZE-HPC / gpu-benches
collection of benchmarks to measure basic GPU capabilities
☆461Updated last month
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆254Updated 5 months ago
njuhope / cuda_sgemm
☆116Updated last year
Cambricon / triton-linalg
Development repository for the Triton-Linalg conversion
☆206Updated 9 months ago
tpoisonooo / how-to-optimize-gemm
row-major matmul optimization
☆690Updated 3 months ago
ColfaxResearch / cfx-article-src
☆158Updated 6 months ago
nicolaswilde / cuda-sgemm
☆70Updated 10 months ago
AyakaGEMM / Hands-on-GEMM
☆144Updated last year
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆241Updated last year
MARD1NO / CUDA-PPT
☆113Updated 8 months ago
cloudcores / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆557Updated 2 years ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆454Updated 6 months ago
ifromeast / cuda_learning
learning how CUDA works
☆347Updated 9 months ago
DefTruth / CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆51Updated 7 months ago
daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆234Updated 3 years ago
Archermmt / tvm_walk_through
code reading for tvm
☆76Updated 3 years ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆316Updated last month
OpenPPL / ppl.llm.kernel.cuda
☆152Updated 10 months ago
yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
☆155Updated 3 years ago
Qwesh157 / conv_op_optimization
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
☆38Updated 2 months ago
Liu-xiandong / How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,190Updated 2 years ago
ColfaxResearch / cutlass-kernels
☆246Updated last year