gpu-mode / lectures
Material for gpu-mode lectures
☆2,967Updated this week
Related projects ⓘ
Alternatives and complementary repositories for lectures
- GPU programming related news and material links☆1,208Updated last month
- Puzzles for learning Triton☆1,089Updated last month
- Tile primitives for speedy kernels☆1,643Updated this week
- 🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv,…☆1,403Updated this week
- An ML Systems Onboarding list☆540Updated 3 months ago
- A native PyTorch Library for large model training☆2,586Updated last week
- UNet diffusion model in pure CUDA☆567Updated 4 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,399Updated this week
- how to optimize some algorithm in cuda.☆1,576Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆615Updated 7 months ago
- Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)☆604Updated 2 months ago
- Solve puzzles. Improve your pytorch.☆3,267Updated 3 months ago
- Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton☆1,325Updated this week
- ☆515Updated 2 weeks ago
- PyTorch native finetuning library☆4,283Updated this week
- Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.☆5,661Updated 3 weeks ago
- Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors a…☆1,190Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆1,955Updated this week
- 📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batc…☆2,795Updated last week
- PyTorch native quantization and sparsity for training and inference☆1,549Updated this week
- nanoGPT style version of Llama 3.1☆1,231Updated 3 months ago
- Schedule-Free Optimization in PyTorch☆1,889Updated this week
- The full minitorch student suite.☆1,912Updated 2 months ago
- Efficient Triton Kernels for LLM Training☆3,401Updated this week
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆601Updated last week
- Slides, notes, and materials for the workshop☆305Updated 5 months ago
- depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.☆499Updated last week
- Fast CUDA matrix multiplication from scratch☆471Updated 10 months ago
- TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillati…☆536Updated this week
- NanoGPT (124M) quality in 7.8 8xH100-minutes☆965Updated this week