wafer-ai/gpu-perf-engineering-resources

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/wafer-ai/gpu-perf-engineering-resources)

wafer-ai / gpu-perf-engineering-resources

A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do

☆1,256

Alternatives and similar repositories for gpu-perf-engineering-resources

Users that are interested in gpu-perf-engineering-resources are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

cfregly / ai-performance-engineering
View on GitHub
Code, labs, and resources for O'Reilly AI Systems Performance Engineering: GPU optimization, distributed training, inference scaling, and…
☆1,674Jul 6, 2026Updated 2 weeks ago
goabiaryan / awesome-gpu-engineering
View on GitHub
GPU Engineering for AI Systems
☆543May 17, 2026Updated 2 months ago
gau-nernst / learn-cuda
View on GitHub
Learn CUDA with PyTorch
☆352Jun 1, 2026Updated last month
patrick-toulme / pyptx
View on GitHub
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
☆367Jul 9, 2026Updated last week
dataflowr / gpu_llm_flash-attention
View on GitHub
Course on Flash-attention in Triton
☆100Feb 9, 2026Updated 5 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
florianmattana / sass-king
View on GitHub
Reverse engineering NVIDIA SASS instruction dictionary, kernel audits and pattern recognition across GPU architectures.
☆311May 18, 2026Updated 2 months ago
rkinas / triton-resources
View on GitHub
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
☆495Mar 10, 2025Updated last year
j4orz / ateenysitp
View on GitHub
a whirlwind tour to deep learning and deep learning systems
☆81Updated this week
wafer-ai / kernel-arena
View on GitHub
Public benchmark results from Kernel Arena, a leaderboard for LLM-generated AI accelerator kernels.
☆20Mar 11, 2026Updated 4 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,988Updated this week
sgl-project / mini-sglang
View on GitHub
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
☆4,607May 17, 2026Updated 2 months ago
tensara / tensara
View on GitHub
Competitive GPU kernel optimization platform.
☆218Updated this week
tugot17 / pmpp
View on GitHub
Complete solutions to the Programming Massively Parallel Processors Edition 4
☆810Jun 18, 2025Updated last year
aerlabsAI / ai-inference-resources
View on GitHub
Curated collection of AI inference engineering resources — LLM serving, GPU kernels, quantization, distributed inference, and production …
☆245Feb 4, 2026Updated 5 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
gautam1858 / tiny-gpu-compiler
View on GitHub
An MLIR-based compiler that takes GPU kernels and compiles them to real hardware instructions. Interactive web visualizer included.
☆139Mar 21, 2026Updated 4 months ago
Infatoshi / cuda-course
View on GitHub
☆3,852Mar 11, 2026Updated 4 months ago
NVIDIA / TileGym
View on GitHub
Helpful kernel tutorials, examples and SKILLs for tile-based GPU programming
☆776Updated this week
NVIDIA / accelerated-computing-hub
View on GitHub
NVIDIA curated collection of educational resources related to general purpose GPU programming.
☆1,847Jul 14, 2026Updated last week
gpu-mode / resource-stream
View on GitHub
GPU programming related news and material links
☆2,233Jun 15, 2026Updated last month
gpu-mode / lectures
View on GitHub
Material for gpu-mode lectures
☆6,330Jun 15, 2026Updated last month
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆477Updated this week
facebookresearch / tensor-layouts
View on GitHub
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
☆231Jun 29, 2026Updated 3 weeks ago
gau-nernst / gpu-mode-kernels
View on GitHub
https://github.com/gpu-mode/reference-kernels
☆26Jul 4, 2026Updated 2 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆10,104Updated this week
stas00 / the-art-of-debugging
View on GitHub
The Art of Debugging Open Book
☆1,667Jul 9, 2026Updated last week
jax-ml / scaling-book
View on GitHub
Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs
☆1,283Jul 13, 2026Updated last week
anyscale / multimodal-ai
View on GitHub
Multimodal AI workloads: batch inference, model training and online serving.
☆111Aug 22, 2025Updated 10 months ago
meta-pytorch / BackendBench
View on GitHub
Ship correct and fast LLM kernels to PyTorch
☆151Jan 14, 2026Updated 6 months ago
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆583Sep 18, 2025Updated 10 months ago
gpu-mode / awesomeMLSys
View on GitHub
An ML Systems Onboarding list
☆1,102Feb 19, 2026Updated 5 months ago
pytorch / helion
View on GitHub
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆910Updated this week
LMCache / LMCache
View on GitHub
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
☆10,732Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
ai-dynamo / dynamo
View on GitHub
A Datacenter Scale Distributed Inference Serving Framework
☆7,540Updated this week
saurabhaloneai / gpu-programming
View on GitHub
☆15Jan 22, 2026Updated 5 months ago
harvard-edge / cs249r_book
View on GitHub
Machine Learning Systems
☆27,503Updated this week
huggingface / kernels
View on GitHub
Build compute kernels and load them from the Hub.
☆715Updated this week
mirage-project / mirage
View on GitHub
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆2,376Updated this week
jmaczan / tiny-vllm
View on GitHub
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
☆938Jul 2, 2026Updated 2 weeks ago
ezyang / cute-interactive
View on GitHub
Interactive version of the CuTe layout paper
☆57Apr 14, 2026Updated 3 months ago