lessw2020 / triton_kernels_for_fun_and_profitLinks

Custom kernels in Triton language for accelerating LLMs

☆26

Alternatives and similar repositories for triton_kernels_for_fun_and_profit

Users that are interested in triton_kernels_for_fun_and_profit are comparing it to the libraries listed below

Sorting:

gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆147Updated 2 years ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆95Updated last month
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
Deep-Learning-Profiling-Tools / triton-viz
☆246Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆159Updated 6 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆388Updated last week
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆58Updated 2 weeks ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆233Updated 5 months ago
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆63Updated this week
cchan / tccl
extensible collectives library in triton
☆90Updated 7 months ago
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆194Updated 5 months ago
gpu-mode / profiling-cuda-in-torch
☆174Updated last year
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆301Updated 2 months ago
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 7 months ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆107Updated 9 months ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆125Updated last year
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆98Updated 2 weeks ago
evintunador / triton_docs_tutorials
making the official triton tutorials actually comprehensible
☆57Updated 2 months ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆58Updated 2 weeks ago
Jokeren / triton-samples
☆28Updated 9 months ago
andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆39Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆216Updated this week
meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆111Updated this week
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆192Updated 2 years ago
SzymonOzog / FastSoftmax
Step by step implementation of a fast softmax kernel in CUDA
☆53Updated 9 months ago
axonn-ai / axonn
Parallel framework for training and fine-tuning deep neural networks
☆65Updated last week
gau-nernst / quantized-training
Explore training for quantized models
☆25Updated 3 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆272Updated this week