Tongkaio/CUDA_Kernel_Samples

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Tongkaio/CUDA_Kernel_Samples)

Tongkaio / CUDA_Kernel_Samples

CUDA 算子手撕与面试指南

☆840

Alternatives and similar repositories for CUDA_Kernel_Samples

Users that are interested in CUDA_Kernel_Samples are comparing it to the libraries listed below

Sorting:

xlite-dev / LeetCUDA
View on GitHub
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆9,755Updated this week
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆2,825Feb 15, 2026Updated 2 weeks ago
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,244Jul 29, 2023Updated 2 years ago
zjhellofss / KuiperLLama
View on GitHub
校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
☆498Oct 28, 2025Updated 4 months ago
PaddleJitLab / CUDATutorial
View on GitHub
A self-learning tutorail for CUDA High Performance Programing.
☆900Jan 14, 2026Updated last month
ifromeast / cuda_learning
View on GitHub
learning how CUDA works
☆376Mar 3, 2025Updated 11 months ago
gpu-mode / lectures
View on GitHub
Material for gpu-mode lectures
☆5,773Feb 1, 2026Updated last month
zjhellofss / KuiperInfer
View on GitHub
校招、秋招、春招、实习好项目！带你从零实现一个高性能的深度学习推理库，支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library st…
☆3,324Jun 22, 2025Updated 8 months ago
caibucai22 / awesome-cuda
View on GitHub
Awesome code, projects, books, etc. related to CUDA
☆31Feb 3, 2026Updated 3 weeks ago
caiwanxianhust / FasterLLaMA
View on GitHub
使用 CUDA C++ 实现的 llama 模型推理框架
☆64Nov 8, 2024Updated last year
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆193Jan 28, 2025Updated last year
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆488Jan 20, 2026Updated last month
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆79Aug 12, 2024Updated last year
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆526Sep 8, 2024Updated last year
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆269Jul 1, 2025Updated 8 months ago
BBuf / tvm_mlir_learn
View on GitHub
compiler learning resources collect.
☆2,684Mar 19, 2025Updated 11 months ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆37Feb 3, 2025Updated last year
zhaochenyang20 / Awesome-ML-SYS-Tutorial
View on GitHub
My learning notes for ML SYS.
☆5,444Jan 30, 2026Updated last month
CalebDu / Awesome-Cute
View on GitHub
☆115May 16, 2025Updated 9 months ago
Tony-Tan / CUDA_Freshman
View on GitHub
☆2,698Jan 16, 2024Updated 2 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆410Feb 11, 2026Updated 2 weeks ago
luliyucoordinate / flash-attention-minimal
View on GitHub
Flash Attention in ~100 lines of CUDA (forward pass only)
☆11Jun 10, 2024Updated last year
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆147May 10, 2025Updated 9 months ago
HeKun-NVIDIA / CUDA-Programming-Guide-in-Chinese
View on GitHub
This is a Chinese translation of the CUDA programming guide
☆1,877Nov 13, 2024Updated last year
BBuf / how-to-learn-deep-learning-framework
View on GitHub
how to learn PyTorch and OneFlow
☆485Mar 22, 2024Updated last year
MLSys-Learner-Resources / Awesome-MLSys-Blogger
View on GitHub
The repository has collected a batch of noteworthy MLSys bloggers (Algorithms/Systems)
☆324Jan 5, 2025Updated last year
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,022Updated this week
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆251Feb 13, 2026Updated 2 weeks ago
Infrasys-AI / AIInfra
View on GitHub
AIInfra（AI 基础设施）指AI系统从底层芯片等硬件，到上层软件栈支持AI大模型训练和推理。
☆6,130Dec 22, 2025Updated 2 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,009Feb 23, 2026Updated last week
violetDelia / LLCompiler
View on GitHub
☆23Jun 11, 2025Updated 8 months ago
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆223Jan 20, 2026Updated last month
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆426Jan 4, 2024Updated 2 years ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated 10 months ago
l0ngc / hpc-learning
View on GitHub
hpc-learning
☆781May 30, 2024Updated last year
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆146Mar 18, 2024Updated last year
zeroine / cutlass-cute-sample
View on GitHub
☆49Apr 15, 2024Updated last year
alexshuang / write-your-own-ai-compiler
View on GitHub
《自己动手写AI编译器》
☆33Oct 19, 2024Updated last year
AdvancedCompiler / AdvancedCompiler
View on GitHub
先进编译实验室的个人主页
☆202Oct 15, 2025Updated 4 months ago