DefTruth / CUDA-Learn-NotesLinks

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

☆51

Alternatives and similar repositories for CUDA-Learn-Notes

Users that are interested in CUDA-Learn-Notes are comparing it to the libraries listed below

Sorting:

DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆254Updated 5 months ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆503Updated last year
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆394Updated last month
ifromeast / cuda_learning
learning how CUDA works
☆344Updated 9 months ago
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆569Updated this week
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆394Updated 11 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆454Updated 6 months ago
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆418Updated last year
flagos-ai / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆773Updated this week
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆358Updated 3 years ago
reed-lau / cute-gemm
☆149Updated 3 weeks ago
nicolaswilde / cuda-sgemm
☆70Updated 10 months ago
RussWong / CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
☆260Updated last year
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆460Updated last year
XiaoSong9905 / CUDA-Optimization-Guide
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
☆318Updated 3 years ago
ArthurinRUC / cutlass-notes
From Minimal GEMM to Everything
☆82Updated 3 weeks ago
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆405Updated 3 years ago
nicolaswilde / cuda-tensorcore-hgemm
☆156Updated 11 months ago
AyakaGEMM / Hands-on-GEMM
☆144Updated last year
Cambricon / triton-linalg
Development repository for the Triton-Linalg conversion
☆206Updated 9 months ago
ColfaxResearch / cfx-article-src
☆158Updated 6 months ago
InfiniTensor / InfiniTensor
☆274Updated last month
CalebDu / Awesome-Cute
☆112Updated 6 months ago
Tongkaio / CUDA_Kernel_Samples
CUDA 算子手撕与面试指南
☆701Updated 3 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆233Updated 2 weeks ago
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆170Updated 10 months ago
OpenPPL / ppl.llm.kernel.cuda
☆152Updated 10 months ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆139Updated this week
RussWong / LLM-engineering
☆26Updated 3 months ago
gty111 / GEMM_WMMA
GEMM by WMMA (tensor core)
☆14Updated 3 years ago