☆1,135Feb 28, 2026Updated this week
Alternatives and similar repositories for ai-performance-engineering
Users that are interested in ai-performance-engineering are comparing it to the libraries listed below
Sorting:
- Material for gpu-mode lectures☆5,800Feb 1, 2026Updated last month
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆403Feb 24, 2026Updated last week
- ☆23Jul 11, 2025Updated 7 months ago
- GPU programming related news and material links☆2,010Sep 17, 2025Updated 5 months ago
- ☆15Feb 24, 2026Updated last week
- Ship correct and fast LLM kernels to PyTorch☆144Jan 14, 2026Updated last month
- A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do☆429Jan 13, 2026Updated last month
- The tool facilitates debugging convergence issues and testing new algorithms and recipes for training LLMs using Nvidia libraries such as…☆18Sep 17, 2025Updated 5 months ago
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆5,284Updated this week
- 100 days of building GPU kernels!☆573Apr 27, 2025Updated 10 months ago
- ☆90Nov 11, 2025Updated 3 months ago
- Learn GPU Programming in Mojo🔥 by Solving Puzzles☆299Feb 18, 2026Updated 2 weeks ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆469Updated this week
- Machine Learning algorithms in pure Mojo 🔥☆65Feb 24, 2026Updated last week
- ☆53Feb 24, 2026Updated last week
- The Art of Debugging Open Book☆1,304Updated this week
- torchcomms: a modern PyTorch communications API☆344Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,099Aug 26, 2025Updated 6 months ago
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- A Quirky Assortment of CuTe Kernels☆838Updated this week
- ☆11Sep 20, 2024Updated last year
- Data Labeling in Machine Learning with Python, by Packt Publishing☆23Feb 5, 2026Updated last month
- Cray-LM unified training and inference stack.☆22Jan 30, 2025Updated last year
- GPU Kernels☆221Apr 27, 2025Updated 10 months ago
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 2 weeks ago
- PyTorch native quantization and sparsity for training and inference☆2,707Updated this week
- CUDA Templates and Python DSLs for High-Performance Linear Algebra☆9,348Updated this week
- Official repository of Sparse ISO-FLOP Transformations for Maximizing Training Efficiency☆25Jul 31, 2024Updated last year
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆177Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆201Jun 1, 2025Updated 9 months ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- Exploring how optimizations for GEMMs work☆28Jan 1, 2026Updated 2 months ago
- ☆15Aug 20, 2025Updated 6 months ago
- My study notes and hands-on projects for CUDA-based GPU programming☆10Dec 11, 2025Updated 2 months ago
- Simple Audio Embedding Toolkit☆12Aug 9, 2025Updated 6 months ago
- Distributed SDDMM Kernel☆12Jul 8, 2022Updated 3 years ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,154Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,145Feb 23, 2026Updated last week