modal-labs / gpu-glossaryLinks
GPU documentation for humans
☆266Updated last week
Alternatives and similar repositories for gpu-glossary
Users that are interested in gpu-glossary are comparing it to the libraries listed below
Sorting:
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆497Updated last week
- Learning about CUDA by writing PTX code.☆135Updated last year
- CPU inference for the DeepSeek family of large language models in C++☆313Updated 3 months ago
- ☆118Updated 6 months ago
- Simple MPI implementation for prototyping or learning☆279Updated last month
- ☆77Updated last month
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆353Updated this week
- Complete solutions to the Programming Massively Parallel Processors Edition 4☆516Updated 3 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆311Updated this week
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer Generator(WIP) for Triton Kernels☆150Updated last week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆92Updated last week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆221Updated 4 months ago
- Notes and exploration code for learning about AI/ML☆198Updated this week
- High-Performance SGEMM on CUDA devices☆101Updated 8 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆408Updated 6 months ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆69Updated last month
- LLM training in simple, raw C/CUDA☆104Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- AI Tensor Engine for ROCm☆276Updated this week
- Learnings and programs related to CUDA☆418Updated 2 months ago
- Fastest kernels written from scratch☆346Updated this week
- kernels, of the mega variety☆496Updated 3 months ago
- ☆367Updated 5 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆356Updated 5 months ago
- Perplexity GPU Kernels☆461Updated last month
- CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.☆191Updated 3 months ago
- An experimental CPU backend for Triton☆153Updated 3 months ago
- A minimal tensor processing unit (TPU), inspired by Google's TPU V2 and V1☆923Updated last month
- Static suckless single batch CUDA-only qwen3-0.6B mini inference engine☆468Updated last week
- Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)☆159Updated last year