Learning about CUDA by writing PTX code.
☆155Feb 27, 2024Updated 2 years ago
Alternatives and similar repositories for learn-ptx
Users that are interested in learn-ptx are comparing it to the libraries listed below
Sorting:
- A high-performance attention mechanism that computes softmax normalization in a single streaming pass using running accumulators (online …☆29Oct 11, 2025Updated 4 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆23Jul 11, 2025Updated 7 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- ☆16Feb 24, 2026Updated last week
- CUTLASS and CuTe Examples☆133Nov 30, 2025Updated 3 months ago
- Experimental GPU language with meta-programming☆26Sep 6, 2024Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Jan 21, 2025Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆253May 6, 2025Updated 10 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆201Jun 1, 2025Updated 9 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 9 months ago
- Fast low-bit matmul kernels in Triton☆436Feb 1, 2026Updated last month
- ☆90Dec 16, 2025Updated 2 months ago
- Custom PTX Instruction Benchmark☆139Feb 27, 2025Updated last year
- Tile primitives for speedy kernels☆3,202Feb 24, 2026Updated last week
- Cuda extensions for PyTorch☆12Dec 2, 2025Updated 3 months ago
- Speeding Up Your Python Codes 1000x☆12Apr 2, 2025Updated 11 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆526Sep 8, 2024Updated last year
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- UNet diffusion model in pure CUDA☆657Jun 28, 2024Updated last year
- ☆79Dec 27, 2024Updated last year
- rust-writing-os course of https://rust.os2edu.cn☆11Apr 29, 2022Updated 3 years ago
- Utilities for Training Very Large Models☆58Sep 25, 2024Updated last year
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆30Mar 16, 2024Updated last year
- An implement of deep learning framework and models in C☆47Apr 1, 2025Updated 11 months ago
- NanoGPT (124M) quality in 2.67B tokens☆28Sep 17, 2025Updated 5 months ago
- TensaLang is a Tensor-first programming language, compiler, and runtime that let you write the Model’s inference engine (e.g. LLMs) and s…☆71Feb 20, 2026Updated 2 weeks ago
- extensible collectives library in triton☆96Mar 31, 2025Updated 11 months ago
- Flexibly track outputs and grad-outputs of torch.nn.Module.☆13Oct 6, 2023Updated 2 years ago
- It's a baby compiler. (Lean btw.)☆16May 19, 2025Updated 9 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,084Dec 30, 2024Updated last year
- GPU programming related news and material links☆2,010Sep 17, 2025Updated 5 months ago
- IREE's PyTorch Frontend, based on Torch Dynamo.☆105Feb 27, 2026Updated last week
- Write a fast kernel and see how you compare against the best humans and AI on gpumode.com☆77Feb 26, 2026Updated last week
- CUDA Matrix Multiplication Optimization☆261Jul 19, 2024Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 8 months ago
- TensorRT encapsulation, learn, rewrite, practice.☆30Oct 19, 2022Updated 3 years ago
- Jax like function transformation engine but micro, microjax☆34Oct 25, 2024Updated last year