gpu-mode / reference-kernels
Reference Kernels for the Leaderboard
☆29Updated this week
Alternatives and similar repositories for reference-kernels:
Users that are interested in reference-kernels are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆40Updated this week
- extensible collectives library in triton☆85Updated 3 weeks ago
- ☆31Updated 3 months ago
- ☆200Updated this week
- LLM training in simple, raw C/CUDA☆92Updated 11 months ago
- Fast low-bit matmul kernels in Triton☆291Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆131Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆161Updated last month
- ☆77Updated 5 months ago
- ☆97Updated last month
- Cataloging released Triton kernels.☆217Updated 3 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 8 months ago
- Fastest kernels written from scratch☆236Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆65Updated 3 weeks ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated last month
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆81Updated last week
- CUDA Matrix Multiplication Optimization☆179Updated 9 months ago
- ring-attention experiments☆129Updated 6 months ago
- ☆87Updated last year
- Collection of kernels written in Triton language☆119Updated 2 weeks ago
- ☆27Updated 3 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 6 months ago
- This repository contains the experimental PyTorch native float8 training UX☆223Updated 8 months ago
- Experimental GPU language with meta-programming☆22Updated 7 months ago
- ☆16Updated 6 months ago
- Learning about CUDA by writing PTX code.☆128Updated last year
- ☆51Updated last week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆177Updated this week
- Experiment of using Tangent to autodiff triton☆78Updated last year