cloneofsimo / ptx-tutorial-by-aislopLinks
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆67Updated 3 months ago
Alternatives and similar repositories for ptx-tutorial-by-aislop
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
Sorting:
- Learning about CUDA by writing PTX code.☆132Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆185Updated 3 weeks ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated this week
- NanoGPT-speedrunning for the poor T4 enjoyers☆66Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆95Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆189Updated last month
- ☆65Updated 2 weeks ago
- Experimental GPU language with meta-programming☆23Updated 9 months ago
- making the official triton tutorials actually comprehensible☆41Updated 3 months ago
- Collection of autoregressive model implementation☆85Updated 2 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆134Updated last year
- A really tiny autograd engine☆94Updated 3 weeks ago
- Load compute kernels from the Hub☆191Updated this week
- train with kittens!☆59Updated 7 months ago
- ☆159Updated last year
- Make triton easier☆46Updated last year
- pytorch from scratch in pure C/CUDA and python☆40Updated 8 months ago
- ☆213Updated 5 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆94Updated last month
- Fast low-bit matmul kernels in Triton☆322Updated this week
- Learn CUDA with PyTorch☆27Updated this week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆131Updated last week
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆60Updated last month
- ring-attention experiments☆144Updated 8 months ago
- Custom triton kernels for training Karpathy's nanoGPT.☆19Updated 8 months ago
- LLM training in simple, raw C/CUDA☆99Updated last year
- extensible collectives library in triton☆86Updated 2 months ago
- TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…☆93Updated this week
- ☆13Updated last year