Liu-xiandong / How_to_optimize_in_GPULinks

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

☆1,166

Alternatives and similar repositories for How_to_optimize_in_GPU

Users that are interested in How_to_optimize_in_GPU are comparing it to the libraries listed below

Sorting:

XiaoSong9905 / CUDA-Optimization-Guide
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
☆316Updated 2 years ago
Tongkaio / CUDA_Kernel_Samples
CUDA 算子手撕与面试指南
☆651Updated 2 months ago
BBuf / how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
☆2,552Updated 2 weeks ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆385Updated 9 months ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆456Updated last year
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆409Updated last year
ifromeast / cuda_learning
learning how CUDA works
☆325Updated 7 months ago
RussWong / CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
☆256Updated last year
tpoisonooo / how-to-optimize-gemm
row-major matmul optimization
☆682Updated 2 months ago
PaddleJitLab / CUDATutorial
A self-learning tutorail for CUDA High Performance Programing.
☆751Updated 3 months ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆485Updated last year
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆388Updated 3 years ago
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆353Updated 3 years ago
brucefan1983 / CUDA-Programming
Sample codes for my CUDA programming book
☆1,901Updated 8 months ago
alexngng / CUDA-Learn-Note
🎉CUDA 笔记 / 高频面试题汇总 / C++笔记，个人笔记，更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
☆43Updated last year
Tony-Tan / CUDA_Freshman
☆2,586Updated last year
deeperlearning / professional-cuda-c-programming
☆470Updated 10 years ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆385Updated last week
Eddie-Wang1120 / Professional-CUDA-C-Programming-Code-and-Notes
CUDA C 编程权威指南代码实现包含了书上第二章到第八章的大部分代码实现和作者笔记，全由作者本人手动实现，难免有错误的地方，请大家谨慎参考，非常欢迎对错误的指正。如果有帮助的话请Star一下，对作者帮助很大，谢谢！
☆363Updated 3 years ago
Eddie-Wang1120 / HPC-Learning-Notes
高性能计算相关知识学习笔记，包含学习笔记和相关知识的代码demo，在持续完善中。如果有帮助的话请Star一下，对作者帮助很大，谢谢！
☆448Updated 2 years ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆428Updated 5 months ago
DefTruth / CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆47Updated 5 months ago
MAhaitao999 / CUDA_Programming
《CUDA编程基础与实践》一书的代码
☆139Updated 3 years ago
BBuf / tvm_mlir_learn
compiler learning resources collect.
☆2,552Updated 7 months ago
HeKun-NVIDIA / CUDA-Programming-Guide-in-Chinese
This is a Chinese translation of the CUDA programming guide
☆1,703Updated 11 months ago
olcf / cuda-training-series
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
☆877Updated last year
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆696Updated this week
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆908Updated last month
zjhellofss / KuiperLLama
校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
☆433Updated 3 months ago
KEKE046 / mlir-tutorial
Hands-On Practical MLIR Tutorial
☆624Updated 2 years ago