XiaoSongXS/CUDA-Optimization-Guide

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/XiaoSongXS/CUDA-Optimization-Guide)

XiaoSongXS / CUDA-Optimization-Guide

Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]

☆328

Alternatives and similar repositories for CUDA-Optimization-Guide

Users that are interested in CUDA-Optimization-Guide are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

XiaoSongXS / HPC-Notes
View on GitHub
Personal Notes for Learning HPC & Parallel Computation [NO LONGER ADDING NEW CONTENT]
☆78Jul 29, 2022Updated 3 years ago
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,329Jul 29, 2023Updated 2 years ago
XiaoSongXS / dgemm-knl
View on GitHub
DGEMM on KNL, achieve 75% MKL
☆18May 19, 2022Updated 4 years ago
LeiWang1999 / AutoGPTQ.tvm
View on GitHub
GPTQ inference TVM kernel
☆41Apr 25, 2024Updated 2 years ago
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,142Updated this week
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
Eddie-Wang1120 / HPC-Learning-Notes
View on GitHub
高性能计算相关知识学习笔记，包含学习笔记和相关知识的代码demo，在持续完善中。如果有帮助的话请Star一下，对作者帮助很大，谢谢！
☆476Mar 28, 2023Updated 3 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
View on GitHub
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆420Jan 2, 2025Updated last year
FZJ-JSC / tutorial-multi-gpu
View on GitHub
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆380Jun 26, 2026Updated 3 weeks ago
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 3 years ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆86May 5, 2025Updated last year
yester31 / Cutlass_EX
View on GitHub
study of cutlass
☆22Nov 10, 2024Updated last year
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆279Jul 1, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆156Mar 18, 2024Updated 2 years ago
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
nicolaswilde / cuda-tensorcore-hgemm
View on GitHub
☆160Dec 26, 2024Updated last year
l0ngc / hpc-learning
View on GitHub
hpc-learning
☆783May 30, 2024Updated 2 years ago
wangzyon / NVIDIA_SGEMM_PRACTICE
View on GitHub
Step-by-step optimization of CUDA SGEMM
☆485Mar 30, 2022Updated 4 years ago
xlite-dev / LeetCUDA
View on GitHub
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆11,578Updated this week
tpoisonooo / how-to-optimize-gemm
View on GitHub
row-major matmul optimization
☆743May 14, 2026Updated 2 months ago
Sunt-ing / stick
View on GitHub
A PyTorch-like deep learning framework. Just for fun.
☆159Oct 9, 2023Updated 2 years ago
gpu-mode / lectures
View on GitHub
Material for gpu-mode lectures
☆6,330Jun 15, 2026Updated last month
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆527Jan 20, 2026Updated 6 months ago
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
ifromeast / cuda_learning
View on GitHub
learning how CUDA works
☆398Mar 3, 2025Updated last year
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆242Jan 20, 2026Updated 6 months ago
reed-lau / cute-gemm
View on GitHub
☆186May 11, 2026Updated 2 months ago
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆10,104Updated this week
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆556Sep 8, 2024Updated last year
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,344Aug 28, 2025Updated 10 months ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
BBuf / tvm_mlir_learn
View on GitHub
compiler learning resources collect.
☆2,759May 20, 2026Updated 2 months ago
flame / how-to-optimize-gemm
View on GitHub
☆2,020Jul 29, 2023Updated 2 years ago
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆769Aug 6, 2025Updated 11 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
MARD1NO / CUDA-PPT
View on GitHub
☆136Apr 16, 2026Updated 3 months ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago