A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
☆804Apr 27, 2026Updated last month
Alternatives and similar repositories for gpu-perf-engineering-resources
Users that are interested in gpu-perf-engineering-resources are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆18Dec 19, 2024Updated last year
- Course on Flash-attention in Triton☆99Feb 9, 2026Updated 4 months ago
- ☆34Mar 12, 2026Updated 2 months ago
- Code snippets and reproductions from JustAByte☆48Apr 6, 2026Updated 2 months ago
- [PACT'24] GraNNDis. A fast and unified distributed graph neural network (GNN) training framework for both full-batch (full-graph) and min…☆10Aug 13, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Learn CUDA with PyTorch☆321Jun 1, 2026Updated last week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆19Feb 9, 2026Updated 4 months ago
- [DATE 2023] Pipe-BD: Pipelined Parallel Blockwise Distillation☆12Jul 13, 2023Updated 2 years ago
- ☆96May 30, 2026Updated last week
- FlashInfer: Kernel Library for LLM Serving☆5,760Updated this week
- ☆1,548Mar 31, 2026Updated 2 months ago
- torchcomms: a modern PyTorch communications API☆368Updated this week
- Numerical experiments on Jacobi SVD algorithm☆10Jun 3, 2018Updated 8 years ago
- everything i know about cuda and triton☆13Jan 28, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆12May 8, 2025Updated last year
- Write a fast kernel and see how you compare against the best humans and AI on gpumode.com☆98May 8, 2026Updated last month
- GPU programming related news and material links☆2,162Mar 8, 2026Updated 3 months ago
- A high-performance acceleration library dedicated to large-scale model training on AMD GPUs☆64Jun 4, 2026Updated last week
- ☆46Dec 20, 2023Updated 2 years ago
- A PyTorch native library for training speculative decoding models☆154Updated this week
- Study Group of Deep Learning Compiler☆170Jan 15, 2023Updated 3 years ago
- ☆14Apr 24, 2024Updated 2 years ago
- ☆54Aug 27, 2024Updated last year
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆210Jun 3, 2026Updated last week
- Inference Llama/Llama2/Llama3 Modes in NumPy☆20Nov 22, 2023Updated 2 years ago
- Step-by-step optimization of CUDA SGEMM☆471Mar 30, 2022Updated 4 years ago
- Personal solutions to the Triton Puzzles☆21Jul 18, 2024Updated last year
- [ICML2025] LoRA fine-tune directly on the INT4 models.☆41Nov 25, 2024Updated last year
- C++ Green threads and coroutines library☆38Aug 20, 2014Updated 11 years ago
- High-performance FlashAttention-2 for AMD, Intel, and Apple GPUs. Drop-in replacement for PyTorch SDPA. Triton backend for ROCm (MI300X, …☆157Jan 27, 2026Updated 4 months ago
- ☆25Jun 26, 2024Updated last year
- This is a repository for the course "Agentic AI Engineering" by Towards AI.☆208Mar 2, 2026Updated 3 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆188Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,216Aug 26, 2025Updated 9 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆445Apr 23, 2026Updated last month
- Building blocks for foundation models.☆626Jan 3, 2024Updated 2 years ago
- 🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.☆305Updated this week
- Experimental plugin for scikit-learn to be able to run (some estimators) on Intel GPUs via numba-dpex.☆17Feb 28, 2024Updated 2 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆65Feb 23, 2025Updated last year