gpu-mode / lecturesLinks

Material for gpu-mode lectures

☆4,718

Alternatives and similar repositories for lectures

Users that are interested in lectures are comparing it to the libraries listed below

Sorting:

gpu-mode / resource-stream
GPU programming related news and material links
☆1,616Updated 6 months ago
srush / Triton-Puzzles
Puzzles for learning Triton
☆1,747Updated 7 months ago
xlite-dev / LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆5,430Updated 2 weeks ago
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆3,349Updated this week
gpu-mode / awesomeMLSys
An ML Systems Onboarding list
☆836Updated 5 months ago
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆2,517Updated this week
BBuf / how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
☆2,309Updated last week
xlite-dev / Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆4,217Updated this week
zhaochenyang20 / Awesome-ML-SYS-Tutorial
My learning notes/codes for ML SYS.
☆2,854Updated this week
huggingface / picotron
Minimalistic 4D-parallelism distributed training framework for education purpose
☆1,588Updated last week
Infatoshi / cuda-course
☆1,261Updated 2 weeks ago
olcf / cuda-training-series
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
☆813Updated 10 months ago
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆867Updated 6 months ago
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Bla…
☆2,548Updated this week
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆1,391Updated this week
mirage-project / mirage
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
☆1,540Updated last week
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆764Updated last year
pytorch / torchtitan
A PyTorch native platform for training generative AI models
☆4,032Updated this week
fla-org / flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
☆2,900Updated this week
AmberLJC / LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
☆1,362Updated this week
KellerJordan / modded-nanogpt
NanoGPT (124M) in 3 minutes
☆2,774Updated 3 weeks ago
minitorch / minitorch
The full minitorch student suite.
☆2,129Updated 10 months ago
PaddleJitLab / CUDATutorial
A self-learning tutorail for CUDA High Performance Programing.
☆674Updated 2 weeks ago
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆2,168Updated this week
HuaizhengZhang / AI-Infra-from-Zero-to-Hero
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Mod…
☆3,081Updated last month
Liu-xiandong / How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,088Updated last year
pytorch-labs / gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
☆6,011Updated 3 months ago
linkedin / Liger-Kernel
Efficient Triton Kernels for LLM Training
☆5,338Updated this week
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,140Updated this week
srush / LLM-Training-Puzzles
What would you do with 1000 H100s...
☆1,061Updated last year