lessw2020 / triton_kernels_for_fun_and_profitLinks
Custom kernels in Triton language for accelerating LLMs
☆25Updated last year
Alternatives and similar repositories for triton_kernels_for_fun_and_profit
Users that are interested in triton_kernels_for_fun_and_profit are comparing it to the libraries listed below
Sorting:
- Learn CUDA with PyTorch☆78Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆142Updated last year
- Cataloging released Triton kernels.☆257Updated last week
- ☆237Updated this week
- ring-attention experiments☆152Updated 11 months ago
- Collection of kernels written in Triton language☆154Updated 5 months ago
- Fast low-bit matmul kernels in Triton☆371Updated last week
- A bunch of kernels that might make stuff slower 😉☆59Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆57Updated this week
- Applied AI experiments and examples for PyTorch☆294Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆221Updated 4 months ago
- extensible collectives library in triton☆87Updated 5 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆120Updated last year
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆212Updated this week
- ☆171Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆193Updated 3 months ago
- How to ensure correctness and ship LLM generated kernels in PyTorch☆58Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- A parallel framework for training deep neural networks☆63Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆223Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆239Updated 3 weeks ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆38Updated last year
- ☆199Updated 8 months ago
- Explore training for quantized models☆24Updated 2 months ago
- ☆330Updated last week
- LLM training in simple, raw C/CUDA☆104Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆92Updated last week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆307Updated this week
- making the official triton tutorials actually comprehensible☆54Updated 3 weeks ago