cuda-mode / lectures

Material for cuda-mode lectures

☆2,401

Related projects: ⓘ

cuda-mode / resource-stream
CUDA related news and material links
☆1,079Updated 2 weeks ago
srush / Triton-Puzzles
Puzzles for learning Triton
☆966Updated this week
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆1,489Updated this week
DefTruth / Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batc…
☆2,475Updated this week
cuda-mode / awesomeMLSys
An ML Systems Onboarding list
☆491Updated last month
DefTruth / CUDA-Learn-Notes
🎉CUDA/C++ 笔记 / 技术博客: fp32、fp16/bf16、fp8/int8、flash_attn、sgemm、sgemv、warp/block reduce、dot prod、elementwise、softmax、layernorm、rmsnorm、his…
☆1,140Updated this week
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆2,333Updated 2 months ago
sustcsonglin / flash-linear-attention
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
☆1,190Updated this week
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…
☆1,811Updated this week
BBuf / how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
☆1,443Updated this week
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆558Updated 5 months ago
clu0 / unet.cu
UNet diffusion model in pure CUDA
☆562Updated 2 months ago
facebookresearch / schedule_free
Schedule-Free Optimization in PyTorch
☆1,800Updated last month
minitorch / minitorch
The full minitorch student suite.
☆1,849Updated last month
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆1,138Updated this week
karpathy / nano-llama31
nanoGPT style version of Llama 3.1
☆1,162Updated last month
NVIDIA / cutlass
CUDA Templates for Linear Algebra Subroutines
☆5,359Updated this week
olcf / cuda-training-series
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
☆541Updated last month
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆726Updated this week
srush / Tensor-Puzzles
Solve puzzles. Improve your pytorch.
☆3,056Updated 2 months ago
PacktPublishing / Learn-CUDA-Programming
Learn CUDA Programming, published by Packt
☆987Updated 8 months ago
pytorch / torchtitan
A native PyTorch Library for large model training
☆1,544Updated this week
horseee / Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
☆1,113Updated this week
NVIDIA / FasterTransformer
Transformer related optimization, including BERT, GPT
☆5,773Updated 5 months ago
srush / LLM-Training-Puzzles
What would you do with 1000 H100s...
☆816Updated 8 months ago
HuangOwen / Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
☆1,054Updated this week
Lightning-AI / lightning-thunder
Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors a…
☆1,131Updated this week
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,180Updated 2 months ago
EleutherAI / cookbook
Deep learning for dummies. All the practical details and useful utilities that go into working with real models.
☆662Updated last month
ModelTC / lightllm
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆2,292Updated this week