cloneofsimo / ptx-tutorial-by-aislopLinks
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
β66Updated 6 months ago
Alternatives and similar repositories for ptx-tutorial-by-aislop
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
Sorting:
- Quantized LLM training in pure CUDA/C++.β32Updated last week
- π· Build compute kernelsβ149Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ193Updated 4 months ago
- NanoGPT-speedrunning for the poor T4 enjoyersβ72Updated 5 months ago
- Learning about CUDA by writing PTX code.β137Updated last year
- Learn CUDA with PyTorchβ84Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!β57Updated last week
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ123Updated 3 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β143Updated last year
- train with kittens!β62Updated 11 months ago
- ring-attention experimentsβ152Updated 11 months ago
- β98Updated last month
- How to ensure correctness and ship LLM generated kernels in PyTorchβ66Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ226Updated 4 months ago
- β173Updated last year
- High-Performance SGEMM on CUDA devicesβ105Updated 8 months ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languagβ¦β99Updated 2 months ago
- Official implementation for Training LLMs with MXFP4β93Updated 5 months ago
- working implimention of deepseek MLAβ44Updated 8 months ago
- making the official triton tutorials actually comprehensibleβ54Updated last month
- H-Net Dynamic Hierarchical Architectureβ80Updated 3 weeks ago
- Load compute kernels from the Hubβ290Updated last week
- β89Updated last year
- NSA Triton Kernels written with GPT5 and Opus 4.1β65Updated last month
- in this repository, i'm going to implement increasingly complex llm inference optimizationsβ68Updated 4 months ago
- β217Updated 8 months ago
- Fast low-bit matmul kernels in Tritonβ376Updated last week
- an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)β105Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ189Updated 3 months ago
- A really tiny autograd engineβ95Updated 4 months ago