cloneofsimo / ptx-tutorial-by-aislopLinks
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 4 months ago
Alternatives and similar repositories for ptx-tutorial-by-aislop
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
Sorting:
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated 2 months ago
- Learning about CUDA by writing PTX code.☆133Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆137Updated last year
- Learn CUDA with PyTorch☆33Updated 2 weeks ago
- Load compute kernels from the Hub☆220Updated this week
- ring-attention experiments☆146Updated 9 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆205Updated 2 months ago
- NanoGPT-speedrunning for the poor T4 enjoyers☆68Updated 3 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆98Updated 2 weeks ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated this week
- ☆76Updated last month
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- 👷 Build compute kernels☆87Updated this week
- ☆162Updated last year
- Fast low-bit matmul kernels in Triton☆338Updated last week
- Dion optimizer algorithm☆193Updated this week
- ☆215Updated 6 months ago
- A bunch of kernels that might make stuff slower 😉☆56Updated this week
- making the official triton tutorials actually comprehensible☆53Updated last week
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆382Updated 4 months ago
- ☆227Updated last week
- train with kittens!☆61Updated 9 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆141Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆149Updated last month
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated 2 months ago
- Simple & Scalable Pretraining for Neural Architecture Research☆277Updated last week
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆64Updated 2 months ago
- ☆88Updated last year
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆189Updated last year
- Cataloging released Triton kernels.☆247Updated 6 months ago