cloneofsimo / ptx-tutorial-by-aislopLinks
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 7 months ago
Alternatives and similar repositories for ptx-tutorial-by-aislop
Users that are interested in ptx-tutorial-by-aislop are comparing it to the libraries listed below
Sorting:
- Quantized LLM training in pure CUDA/C++.☆215Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆195Updated 5 months ago
- 👷 Build compute kernels☆171Updated this week
- Learning about CUDA by writing PTX code.☆147Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆61Updated this week
- Learn CUDA with PyTorch☆104Updated last week
- ring-attention experiments☆155Updated last year
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆138Updated 2 months ago
- Load compute kernels from the Hub☆326Updated this week
- How to ensure correctness and ship LLM generated kernels in PyTorch☆117Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆242Updated 6 months ago
- High-Performance SGEMM on CUDA devices☆110Updated 9 months ago
- NanoGPT-speedrunning for the poor T4 enjoyers☆72Updated 6 months ago
- ☆106Updated last week
- making the official triton tutorials actually comprehensible☆61Updated 2 months ago
- A bunch of kernels that might make stuff slower 😉☆64Updated this week
- ☆218Updated 9 months ago
- The evaluation framework for training-free sparse attention in LLMs☆102Updated last month
- Official implementation for Training LLMs with MXFP4☆102Updated 6 months ago
- ☆89Updated last year
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆302Updated last week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆148Updated 2 years ago
- ☆177Updated last year
- coding CUDA everyday!☆69Updated this week
- Simple & Scalable Pretraining for Neural Architecture Research☆299Updated 2 weeks ago
- NSA Triton Kernels written with GPT5 and Opus 4.1☆65Updated 3 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆428Updated 8 months ago
- Fast low-bit matmul kernels in Triton☆395Updated 2 weeks ago
- Explore training for quantized models☆25Updated 4 months ago
- Hand-Rolled GPU communications library☆58Updated this week