ifromeast/cuda_learning

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ifromeast/cuda_learning)

ifromeast / cuda_learning

learning how CUDA works

☆398

Alternatives and similar repositories for cuda_learning

Users that are interested in cuda_learning are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,142Updated this week
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆527Jan 20, 2026Updated 6 months ago
xlite-dev / LeetCUDA
View on GitHub
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆11,599Updated this week
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆279Jul 1, 2025Updated last year
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,329Jul 29, 2023Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆556Sep 8, 2024Updated last year
reed-lau / cute-gemm
View on GitHub
☆186May 11, 2026Updated 2 months ago
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆156Mar 18, 2024Updated 2 years ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
gpu-mode / lectures
View on GitHub
Material for gpu-mode lectures
☆6,334Jun 15, 2026Updated last month
tspeterkim / flash-attention-minimal
View on GitHub
Flash Attention in ~100 lines of CUDA (forward pass only)
☆1,169Dec 30, 2024Updated last year
zjhellofss / KuiperLLama
View on GitHub
校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
☆552Oct 28, 2025Updated 8 months ago
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,404Jun 23, 2026Updated 3 weeks ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
ColfaxResearch / cfx-article-src
View on GitHub
☆192May 7, 2025Updated last year
Tony-Tan / CUDA_Freshman
View on GitHub
☆2,780Jan 16, 2024Updated 2 years ago
nicolaswilde / cuda-sgemm
View on GitHub
☆73Jan 6, 2025Updated last year
Tongkaio / CUDA_Kernel_Samples
View on GitHub
CUDA 算子手撕与面试指南
☆1,045Aug 23, 2025Updated 10 months ago
XiaoSongXS / CUDA-Optimization-Guide
View on GitHub
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
☆328Nov 8, 2022Updated 3 years ago
godweiyang / NN-CUDA-Example
View on GitHub
Several simple examples for popular neural network toolkits calling custom CUDA operators.
☆1,538Apr 29, 2021Updated 5 years ago
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆584Sep 18, 2025Updated 10 months ago
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 3 years ago
feifeibear / LLMRoofline
View on GitHub
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆123Mar 13, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,994Updated this week
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
latentCall145 / channels-last-groupnorm
View on GitHub
A CUDA kernel for NHWC GroupNorm for PyTorch
☆23Nov 15, 2024Updated last year
PaddleJitLab / CUDATutorial
View on GitHub
A self-learning tutorail for CUDA High Performance Programing.
☆1,048Jan 14, 2026Updated 6 months ago
wangzyon / NVIDIA_SGEMM_PRACTICE
View on GitHub
Step-by-step optimization of CUDA SGEMM
☆485Mar 30, 2022Updated 4 years ago
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆136Nov 30, 2025Updated 7 months ago
Eddie-Wang1120 / Professional-CUDA-C-Programming-Code-and-Notes
View on GitHub
CUDA C 编程权威指南代码实现包含了书上第二章到第八章的大部分代码实现和作者笔记，全由作者本人手动实现，难免有错误的地方，请大家谨慎参考，非常欢迎对错误的指正。如果有帮助的话请Star一下，对作者帮助很大，谢谢！
☆387Oct 20, 2022Updated 3 years ago
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
caiwanxianhust / FasterLLaMA
View on GitHub
使用 CUDA C++ 实现的 llama 模型推理框架
☆64Nov 8, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆150Jan 9, 2025Updated last year
CalebDu / Awesome-Cute
View on GitHub
☆121May 16, 2025Updated last year
xlite-dev / qwen-image-fast
View on GitHub
⚡️Qwen-Image 4.8x🎉 speedup with Hybrid Acceleration for low VRAM GPUs
☆17Oct 24, 2025Updated 8 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆315Jul 15, 2026Updated last week
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year