cloneofsimo / ptx-tutorial-by-aislopLinks
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 3 months ago
Alternatives and similar repositories for ptx-tutorial-by-aislop
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
Sorting:
- Learning about CUDA by writing PTX code.☆133Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆195Updated 2 months ago
- ring-attention experiments☆144Updated 8 months ago
- Load compute kernels from the Hub☆203Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated 2 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆134Updated last year
- ☆161Updated last year
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- Learn CUDA with PyTorch☆29Updated this week
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆378Updated 4 months ago
- NanoGPT-speedrunning for the poor T4 enjoyers☆68Updated 2 months ago
- making the official triton tutorials actually comprehensible☆45Updated 3 months ago
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆63Updated last month
- ☆198Updated 5 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆96Updated last month
- ☆88Updated last year
- A bunch of kernels that might make stuff slower 😉☆54Updated this week
- Fast low-bit matmul kernels in Triton☆330Updated this week
- ☆214Updated 5 months ago
- ☆71Updated 2 weeks ago
- Collection of kernels written in Triton language☆136Updated 3 months ago
- A really tiny autograd engine☆94Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆128Updated 7 months ago
- ☆225Updated this week
- The evaluation framework for training-free sparse attention in LLMs☆82Updated 3 weeks ago
- Simple MPI implementation for prototyping or learning☆259Updated 2 weeks ago
- SIMD quantization kernels☆73Updated last week
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆87Updated 2 weeks ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆135Updated this week