stanford-cs149 / cs149gpt
☆67Updated last year
Alternatives and similar repositories for cs149gpt:
Users that are interested in cs149gpt are comparing it to the libraries listed below
- Cataloging released Triton kernels.☆204Updated 2 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆127Updated last year
- ☆191Updated this week
- Fast low-bit matmul kernels in Triton☆267Updated this week
- Fastest kernels written from scratch☆199Updated 2 weeks ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆65Updated 4 years ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆234Updated this week
- Stanford CS149 -- Assignment 1☆90Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆34Updated this week
- CUDA Matrix Multiplication Optimization☆173Updated 8 months ago
- ☆136Updated 2 months ago
- ☆191Updated 8 months ago
- extensible collectives library in triton☆84Updated 6 months ago
- ☆73Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆249Updated this week
- ☆87Updated 2 weeks ago
- ☆55Updated 2 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆73Updated 6 months ago
- ☆151Updated last year
- Collection of kernels written in Triton language☆114Updated last month
- ring-attention experiments☆127Updated 5 months ago
- Custom kernels in Triton language for accelerating LLMs☆18Updated 11 months ago
- Step-by-step optimization of CUDA SGEMM☆294Updated 2 years ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆101Updated 8 months ago
- An experimental CPU backend for Triton☆100Updated last week
- Learning about CUDA by writing PTX code.☆124Updated last year
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆174Updated last year