📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆72Apr 26, 2025Updated 10 months ago
Alternatives and similar repositories for CUDA-Learn-Notes
Users that are interested in CUDA-Learn-Notes are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆27Apr 7, 2025Updated 11 months ago
- TinyML and Efficient Deep Learning Computing☆20Apr 26, 2024Updated last year
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆9,932Updated this week
- ☆33Dec 10, 2025Updated 3 months ago
- A lightweight design for computation-communication overlap.☆225Jan 20, 2026Updated 2 months ago
- HIP backend patch for Numba, the NumPy aware dynamic Python compiler using LLVM.☆19Feb 16, 2026Updated last month
- GEMM by WMMA (tensor core)☆15Jul 31, 2022Updated 3 years ago
- FHE (CKKS, TFHE) end-to-end applications: HELR (logistic regression), ResNet-20, LSTM (RNN), bitonic sorting, DeepCNN-x☆18Aug 14, 2024Updated last year
- 🎉CUDA 笔记 / 高频面试题汇总 / C++笔记,个人笔记,更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.☆40Jan 25, 2024Updated 2 years ago
- My study note for mlsys☆14Nov 4, 2024Updated last year
- d3 plugin for web interfaces☆13Jul 2, 2020Updated 5 years ago
- Toolkit for launching and observing MaxText training on Slurm-managed GPU clusters☆24Updated this week
- 基于PyTorch GPT-2的针对各种数据并行pretrain的研究代码.☆11Dec 16, 2022Updated 3 years ago
- ☆61Feb 15, 2026Updated last month
- Wan: Open and Advanced Large-Scale Video Generative Models☆28Jul 28, 2025Updated 7 months ago
- analyse problems of AI with Math and Code☆27Jul 28, 2025Updated 7 months ago
- 训练营训练方向项目☆26Jan 28, 2026Updated last month
- CUTLASS and CuTe Examples☆132Nov 30, 2025Updated 3 months ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆63Nov 8, 2024Updated last year
- ☆128Updated this week
- ☆50Mar 14, 2025Updated last year
- LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model☆79Updated this week
- Exploring how optimizations for GEMMs work☆28Feb 28, 2026Updated 3 weeks ago
- Vector math library using RISC-V vector ISA via C intrinsic☆24Jan 14, 2026Updated 2 months ago
- ☆58May 4, 2024Updated last year
- CUDA 算子手撕与面试指南☆881Aug 23, 2025Updated 7 months ago
- This is the official PyTorch implementation of "BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation."☆40Oct 9, 2025Updated 5 months ago
- An MLIR-based compiler from C/C++ to AMD-Xilinx Versal AIE☆17Aug 5, 2022Updated 3 years ago
- The official implementation of "Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers" (arXiv …☆51Jun 6, 2025Updated 9 months ago
- ☆71Jan 6, 2025Updated last year
- A JupyterLab extension for displaying dashboards of GPU usage.☆13Aug 24, 2023Updated 2 years ago
- how to optimize some algorithm in cuda.☆2,872Updated this week
- Lambda 作品集☆11Feb 28, 2023Updated 3 years ago
- This repo includes XiangShan's function units☆30Feb 14, 2026Updated last month
- Puzzles for learning Triton, play it with minimal environment configuration!☆647Updated this week
- ☆65Apr 26, 2025Updated 10 months ago
- ☆12Sep 18, 2024Updated last year
- ☆13Aug 15, 2022Updated 3 years ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆531Sep 8, 2024Updated last year