📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆66Apr 26, 2025Updated 10 months ago
Alternatives and similar repositories for CUDA-Learn-Notes
Users that are interested in CUDA-Learn-Notes are comparing it to the libraries listed below
Sorting:
- ☆25Apr 7, 2025Updated 10 months ago
- TinyML and Efficient Deep Learning Computing☆19Apr 26, 2024Updated last year
- A lightweight design for computation-communication overlap.☆223Jan 20, 2026Updated last month
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆9,755Updated this week
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆101Sep 8, 2025Updated 5 months ago
- 🎉CUDA 笔记 / 高频面试题汇总 / C++笔记,个人笔记,更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.☆39Jan 25, 2024Updated 2 years ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Feb 11, 2026Updated 2 weeks ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- hadoop 的 docker 集群配置☆11Jun 8, 2024Updated last year
- ☆33Dec 10, 2025Updated 2 months ago
- ☆49Mar 14, 2025Updated 11 months ago
- 这个库用于从零开始,搭建一个基于开源大模型的对话系统。包括基本的对话、与文档对话、智能体等多种功能☆10Sep 21, 2024Updated last year
- ☆12Sep 18, 2024Updated last year
- An MLIR-based compiler from C/C++ to AMD-Xilinx Versal AIE☆17Aug 5, 2022Updated 3 years ago
- 很好用的tnn classify demo☆11Mar 24, 2021Updated 4 years ago
- a student trainning project for HLS and transformer☆11Oct 19, 2022Updated 3 years ago
- use yolov3 onnx model to implement object detection☆11Apr 25, 2019Updated 6 years ago
- 给llvm17.0.6添加一个新后端Cpu0☆12Apr 22, 2024Updated last year
- ☆11Apr 4, 2022Updated 3 years ago
- Implemetation of "Pixel-In-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild"☆11Jul 6, 2023Updated 2 years ago
- [DATE'2025, TCAD'2025] Terafly : A Multi-Node FPGA Based Accelerator Design for Efficient Cooperative Inference in LLMs☆28Nov 13, 2025Updated 3 months ago
- ☆13Jan 7, 2025Updated last year
- ☆14Jun 22, 2022Updated 3 years ago
- ☆13Aug 15, 2022Updated 3 years ago
- This repo contains the Assignments from Cornell Tech's ECE 5545 - Machine Learning Hardware and Systems offered in Spring 2023☆41May 31, 2023Updated 2 years ago
- musl libc projects (such as _BSD_SOURCE)☆18Jan 1, 2014Updated 12 years ago
- The C++ matting code is based on BackgroundMattingV2 and RobustVideoMatting.☆11Nov 20, 2021Updated 4 years ago
- Exploring how optimizations for GEMMs work☆28Jan 1, 2026Updated 2 months ago
- A docker image for One Student One Chip's debug exam☆10Sep 22, 2023Updated 2 years ago
- The first open source triton inference engine for Stable Diffusion, specifically for sdxl☆12Nov 27, 2023Updated 2 years ago
- X/Twitter clone in Axum 0.7.1☆13Dec 13, 2023Updated 2 years ago
- 收作业智能姬☆10Oct 11, 2019Updated 6 years ago
- IntelliJ platform plugin for Wavefront OBJ format☆15Updated this week
- Kernel Library Wheel for SGLang☆16Updated this week
- ☕️ A vscode extension for netron, support *.pdmodel, *.nb, *.onnx, *.pb, *.h5, *.tflite, *.pth, *.pt, *.mnn, *.param, etc.☆14Jun 4, 2023Updated 2 years ago
- Notes for my Calculus courses in college, written in Jupyter Notebooks☆12Jul 31, 2016Updated 9 years ago
- ☆10Oct 8, 2021Updated 4 years ago
- Dockerized container for MODNet - a Real-Time Portrait Matting solution☆13Mar 27, 2023Updated 2 years ago
- The CCMO belongs to the category of multi-objective evolutionary algorithms (MOEAs). CCMO is a powerful algorithm to solve the constraine…☆13Jul 23, 2023Updated 2 years ago