A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
☆857Apr 27, 2026Updated 2 months ago
Alternatives and similar repositories for gpu-perf-engineering-resources
Users that are interested in gpu-perf-engineering-resources are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆18Dec 19, 2024Updated last year
- Code snippets and reproductions from JustAByte☆48Apr 6, 2026Updated 2 months ago
- [PACT'24] GraNNDis. A fast and unified distributed graph neural network (GNN) training framework for both full-batch (full-graph) and min…☆10Aug 13, 2024Updated last year
- Learn CUDA with PyTorch☆339Jun 1, 2026Updated 3 weeks ago
- ☆98May 30, 2026Updated last month
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- FlashInfer: Kernel Library for LLM Serving☆5,867Updated this week
- Code, labs, and resources for O'Reilly AI Systems Performance Engineering: GPU optimization, distributed training, inference scaling, and…☆1,614Updated this week
- Numerical experiments on Jacobi SVD algorithm☆10Jun 3, 2018Updated 8 years ago
- torchcomms: a modern PyTorch communications API☆373Updated this week
- ☆12May 8, 2025Updated last year
- Write a fast kernel and see how you compare against the best humans and AI on gpumode.com☆100Updated this week
- GPU programming related news and material links☆2,206Jun 15, 2026Updated 2 weeks ago
- ☆46Dec 20, 2023Updated 2 years ago
- ☆14Apr 24, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆55Aug 27, 2024Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆212Jun 10, 2026Updated 3 weeks ago
- A PyTorch native library for training speculative decoding models☆174Updated this week
- A Quadratic Unconstrained Binary Optimization (QUBO) solver library using quantum and classical approaches☆25Jun 24, 2026Updated last week
- Step-by-step optimization of CUDA SGEMM☆477Mar 30, 2022Updated 4 years ago
- Personal solutions to the Triton Puzzles☆21Jul 18, 2024Updated last year
- [ICML2025] LoRA fine-tune directly on the INT4 models.☆41Nov 25, 2024Updated last year
- Optimized Utilities for PyTorch☆26Jan 19, 2020Updated 6 years ago
- General Matrix Multiplication using NVIDIA Tensor Cores☆28Jan 25, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs☆27Jun 25, 2024Updated 2 years ago
- C++ Green threads and coroutines library☆38Aug 20, 2014Updated 11 years ago
- ☆155Mar 31, 2026Updated 3 months ago
- High-performance FlashAttention-2 for AMD, Intel, and Apple GPUs. Drop-in replacement for PyTorch SDPA. Triton backend for ROCm (MI300X, …☆158Jan 27, 2026Updated 5 months ago
- ☆25Jun 26, 2024Updated 2 years ago
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,228Aug 26, 2025Updated 10 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆452Updated this week
- 🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.☆310Jun 22, 2026Updated last week
- Experimental plugin for scikit-learn to be able to run (some estimators) on Intel GPUs via numba-dpex.☆17Feb 28, 2024Updated 2 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- ☆20Nov 7, 2019Updated 6 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆65Feb 23, 2025Updated last year
- ☆21Jun 6, 2024Updated 2 years ago
- Synthesizer for optimal collective communication algorithms☆125Apr 8, 2024Updated 2 years ago
- Tile primitives for speedy kernels☆3,497Jun 15, 2026Updated 2 weeks ago
- Benchmark computing Black Scholes formula using different technologies☆14Jun 5, 2025Updated last year
- Where is __pyx_capi__?☆14Feb 15, 2018Updated 8 years ago