XiaoSong9905 / CUDA-Optimization-GuideLinks

Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]

☆309

Alternatives and similar repositories for CUDA-Optimization-Guide

Users that are interested in CUDA-Optimization-Guide are comparing it to the libraries listed below

Sorting:

Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆392Updated last year
XiaoSong9905 / HPC-Notes
Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]
☆69Updated 3 years ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆370Updated 7 months ago
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆340Updated 3 years ago
Liu-xiandong / How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,116Updated 2 years ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆369Updated 10 months ago
ifromeast / cuda_learning
learning how CUDA works
☆295Updated 5 months ago
nicolaswilde / cuda-tensorcore-hgemm
☆149Updated 7 months ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆450Updated 10 months ago
njuhope / cuda_sgemm
☆113Updated last year
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆103Updated 9 months ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆445Updated last year
RussWong / CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
☆247Updated last year
Tongkaio / CUDA_Kernel_Samples
CUDA 算子手撕与面试指南
☆511Updated 6 months ago
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆362Updated 3 years ago
reed-lau / cute-gemm
☆128Updated 8 months ago
eedalong / ECE408
Code base and slides for ECE408：Applied Parallel Programming On GPU.
☆128Updated 4 years ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆398Updated 2 months ago
nicolaswilde / cuda-sgemm
☆67Updated 7 months ago
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆150Updated 6 months ago
InfiniTensor / InfiniTensor
☆246Updated this week
tpoisonooo / how-to-optimize-gemm
row-major matmul optimization
☆649Updated last year
yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
☆151Updated 3 years ago
Eddie-Wang1120 / HPC-Learning-Notes
高性能计算相关知识学习笔记，包含学习笔记和相关知识的代码demo，在持续完善中。如果有帮助的话请Star一下，对作者帮助很大，谢谢！
☆442Updated 2 years ago
MARD1NO / CUDA-PPT
☆102Updated 4 months ago
alexngng / CUDA-Learn-Note
🎉CUDA 笔记 / 高频面试题汇总 / C++笔记，个人笔记，更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
☆36Updated last year
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆156Updated last year
DefTruth / CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆36Updated 3 months ago