thevasudevgupta / gpt-tritonLinks
Triton implementation of GPT/LLAMA
☆18Updated 9 months ago
Alternatives and similar repositories for gpt-triton
Users that are interested in gpt-triton are comparing it to the libraries listed below
Sorting:
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆133Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆184Updated last week
- ☆157Updated last year
- ☆168Updated 5 months ago
- ring-attention experiments☆143Updated 7 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆181Updated 3 weeks ago
- Load compute kernels from the Hub☆139Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆44Updated this week
- Collection of autoregressive model implementation☆85Updated last month
- This repository contains the experimental PyTorch native float8 training UX☆223Updated 10 months ago
- A bunch of kernels that might make stuff slower 😉☆46Updated this week
- Mixed precision training from scratch with Tensors and CUDA☆23Updated last year
- ☆188Updated 3 months ago
- making the official triton tutorials actually comprehensible☆34Updated 2 months ago
- Collection of kernels written in Triton language☆125Updated last month
- Fast low-bit matmul kernels in Triton☆303Updated last week
- ☆78Updated 10 months ago
- VIT inference in triton because, why not?☆28Updated last year
- Cataloging released Triton kernels.☆226Updated 4 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆75Updated last year
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆249Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆196Updated this week
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆183Updated last year
- NanoGPT-speedrunning for the poor T4 enjoyers☆66Updated last month
- ☆108Updated last year
- ☆88Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆67Updated 2 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆216Updated 6 months ago
- ☆210Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago