gpu-mode / lecturesLinks
Material for gpu-mode lectures
โ4,501Updated 3 months ago
Alternatives and similar repositories for lectures
Users that are interested in lectures are comparing it to the libraries listed below
Sorting:
- GPU programming related news and material linksโ1,527Updated 4 months ago
- ๐LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners๐, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.๐ฅโ4,549Updated this week
- Puzzles for learning Tritonโ1,658Updated 6 months ago
- how to optimize some algorithm in cuda.โ2,228Updated last week
- Tile primitives for speedy kernelsโ2,399Updated last week
- FlashInfer: Kernel Library for LLM Servingโ3,088Updated this week
- An ML Systems Onboarding listโ789Updated 4 months ago
- ๐A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.โ4,064Updated last week
- ๐ Efficient implementations of state-of-the-art linear attention models in Torch and Tritonโ2,438Updated last week
- My learning notes/codes for ML SYS.โ2,337Updated this week
- Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)โ775Updated 9 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeโ1,505Updated 2 months ago
- Large Language Model (LLM) Systems Paper Listโ1,246Updated last week
- CUDA Templates for Linear Algebra Subroutinesโ7,603Updated this week
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsโ1,225Updated this week
- A PyTorch native platform for training generative AI modelsโ3,868Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)โ827Updated 5 months ago
- Fast CUDA matrix multiplication from scratchโ730Updated last year
- This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce severalโฆโ1,054Updated last year
- PyTorch native quantization and sparsity for training and inferenceโ2,072Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaโฆโ2,450Updated this week
- What would you do with 1000 H100s...โ1,048Updated last year
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationโ3,041Updated 3 weeks ago
- โ1,148Updated last month
- Building blocks for foundation models.โ500Updated last year
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDAโ850Updated this week
- Efficient Triton Kernels for LLM Trainingโ5,120Updated this week
- The full minitorch student suite.โ2,081Updated 9 months ago
- NanoGPT (124M) in 3 minutesโ2,600Updated last week
- CUDA ็ฎๅญๆๆไธ้ข่ฏๆๅโ380Updated 4 months ago