📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆74Apr 26, 2025Updated 11 months ago
Alternatives and similar repositories for CUDA-Learn-Notes
Users that are interested in CUDA-Learn-Notes are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆27Apr 7, 2025Updated last year
- TinyML and Efficient Deep Learning Computing☆20Apr 26, 2024Updated last year
- CUDA & Triton Learning Project: Flash Attention 实现探索☆29Aug 14, 2025Updated 7 months ago
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆10,217Updated this week
- [CVPR2026] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers☆31Mar 17, 2026Updated 3 weeks ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- ☆33Dec 10, 2025Updated 4 months ago
- A lightweight design for computation-communication overlap.☆226Jan 20, 2026Updated 2 months ago
- HIP backend patch for Numba, the NumPy aware dynamic Python compiler using LLVM.☆19Feb 16, 2026Updated last month
- GEMM by WMMA (tensor core)☆15Jul 31, 2022Updated 3 years ago
- ☆10May 21, 2020Updated 5 years ago
- An NVIDIA AI Workbench example project for exploring the RAPIDS cuDF library☆18Oct 7, 2025Updated 6 months ago
- 🎉CUDA 笔记 / 高频面试题汇总 / C++笔记,个人笔记,更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.☆43Jan 25, 2024Updated 2 years ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆103Sep 8, 2025Updated 7 months ago
- 这个库用于从零开始,搭建一个基于开源大模型的对话系统。包括基本的对话、与文档对话、智能体等多种功能☆10Sep 21, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- Toolkit for launching and observing MaxText training on Slurm-managed GPU clusters☆27Updated this week
- ☆63Feb 15, 2026Updated last month
- 训练营训练方向项目☆26Jan 28, 2026Updated 2 months ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆65Nov 8, 2024Updated last year
- CUTLASS and CuTe Examples☆134Nov 30, 2025Updated 4 months ago
- ☆50Mar 14, 2025Updated last year
- This repository is a LaTeX project of a document that follows all the submission requirements for Computers & Geosciences.☆20Jan 8, 2022Updated 4 years ago
- Wan: Open and Advanced Large-Scale Video Generative Models☆29Jul 28, 2025Updated 8 months ago
- ☆23Sep 9, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Ollama RAG using SQL Database☆12Apr 16, 2025Updated 11 months ago
- ☆18Dec 24, 2023Updated 2 years ago
- LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model☆81Updated this week
- Exploring how optimizations for GEMMs work☆30Feb 28, 2026Updated last month
- ☆58May 4, 2024Updated last year
- CUDA 算子手撕与面试指南☆914Aug 23, 2025Updated 7 months ago
- Practice exercises and assessments for NVIDIA DLI's "Fundamentals of Accelerated Computing with CUDA Python" course.☆30Sep 8, 2023Updated 2 years ago
- This is the official PyTorch implementation of "BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation."☆40Oct 9, 2025Updated 6 months ago
- An MLIR-based compiler from C/C++ to AMD-Xilinx Versal AIE☆17Aug 5, 2022Updated 3 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- The official implementation of "Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers" (arXiv …☆51Jun 6, 2025Updated 10 months ago
- ☆14Jun 26, 2024Updated last year
- Code for Draft Attention☆101May 22, 2025Updated 10 months ago
- ☆72Jan 6, 2025Updated last year
- MIT6.S081实验记录,并且利用Docker+code-server(网页版Vscode)进行环境搭建,实现开箱即用的纯净实验环境,具体使用说明请看下面的网站☆12Jan 28, 2024Updated 2 years ago
- how to optimize some algorithm in cuda.☆2,910Apr 1, 2026Updated last week
- Lambda 作品集☆11Feb 28, 2023Updated 3 years ago