VikParuchuri / triton_tutorialLinks
Tutorials for Triton, a language for writing gpu kernels
☆30Updated last year
Alternatives and similar repositories for triton_tutorial
Users that are interested in triton_tutorial are comparing it to the libraries listed below
Sorting:
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- ☆162Updated last year
- ☆81Updated last year
- Load compute kernels from the Hub☆210Updated this week
- Learn CUDA with PyTorch☆32Updated last week
- ☆181Updated 6 months ago
- Experiment of using Tangent to autodiff triton☆79Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆144Updated last month
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆136Updated last year
- ring-attention experiments☆145Updated 9 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 4 months ago
- Mixed precision training from scratch with Tensors and CUDA☆24Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 11 months ago
- Cataloging released Triton kernels.☆245Updated 6 months ago
- Implementations of attention with the softpick function, naive and FlashAttention-2☆80Updated 2 months ago
- ☆113Updated last year
- Code for studying the super weight in LLM☆113Updated 7 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆202Updated 2 months ago
- research impl of Native Sparse Attention (2502.11089)☆58Updated 5 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆80Updated last week
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆64Updated 3 weeks ago
- ML/DL Math and Method notes☆61Updated last year
- Easily run PyTorch on multiple GPUs & machines☆46Updated last month
- Write a fast kernel and run it on Discord. See how you compare against the best!☆47Updated this week
- A bunch of kernels that might make stuff slower 😉☆56Updated last week
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆149Updated 3 weeks ago
- An implementation of the Llama architecture, to instruct and delight☆21Updated last month
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆138Updated 11 months ago
- The evaluation framework for training-free sparse attention in LLMs☆85Updated last month
- train with kittens!☆61Updated 9 months ago