cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆60Updated this week
Alternatives and similar repositories for ptx-tutorial-by-aislop:
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆87Updated 2 months ago
- Learning about CUDA by writing PTX code.☆124Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆153Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆167Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆34Updated this week
- ring-attention experiments☆128Updated 5 months ago
- Load compute kernels from the Hub☆99Updated this week
- Collection of autoregressive model implementation☆83Updated last month
- an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)☆91Updated 3 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆235Updated last month
- Collection of kernels written in Triton language☆114Updated last month
- extensible collectives library in triton☆84Updated 6 months ago
- Cataloging released Triton kernels.☆208Updated 2 months ago
- NanoGPT (124M) quality in 2.67B tokens☆28Updated last month
- Fast low-bit matmul kernels in Triton☆272Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆127Updated last year
- Make triton easier☆47Updated 9 months ago
- ☆203Updated 2 months ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 5 months ago
- ☆152Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- research impl of Native Sparse Attention (2502.11089)☆54Updated last month
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆237Updated last week
- Learn CUDA with PyTorch☆19Updated last month
- LLM training in simple, raw C/CUDA☆92Updated 10 months ago
- ☆192Updated this week
- ☆87Updated last year
- Experiment of using Tangent to autodiff triton☆78Updated last year
- Cray-LM unified training and inference stack.☆21Updated last month
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 8 months ago