Qwesh157 / conv_op_optimizationLinks

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

☆37

Alternatives and similar repositories for conv_op_optimization

Users that are interested in conv_op_optimization are comparing it to the libraries listed below

Sorting:

nicolaswilde / cuda-tensorcore-hgemm
☆149Updated 7 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month
reed-lau / cute-gemm
☆128Updated 7 months ago
CalebDu / Awesome-Cute
☆89Updated 2 months ago
zeroine / cutlass-cute-sample
☆37Updated last year
nicolaswilde / cuda-sgemm
☆67Updated 6 months ago
gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆31Updated last year
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
njuhope / cuda_sgemm
☆113Updated last year
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆63Updated 10 months ago
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆150Updated 6 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆369Updated 10 months ago
ColfaxResearch / cfx-article-src
☆127Updated 2 months ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
MARD1NO / CUDA-PPT
☆102Updated 4 months ago
XiaoSong9905 / HPC-Notes
Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]
☆69Updated 3 years ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆445Updated 10 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆92Updated 7 months ago
FdyCN / PTX-ISA
CUDA PTX-ISA Document 中文翻译版
☆45Updated 2 months ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆369Updated 7 months ago
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆392Updated last year
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆340Updated 3 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
rchardx / cuda-gemm
☆25Updated 4 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆74Updated 11 months ago
iclementine / optimize_softmax
Optimize softmax in triton in many cases
☆21Updated 10 months ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆213Updated last year
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆93Updated 2 months ago
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆65Updated 2 weeks ago
RussWong / LLM-engineering
☆24Updated 4 months ago