huggingface / gpu-fryerLinks
Where GPUs get cooked π©βπ³π₯
β294Updated last month
Alternatives and similar repositories for gpu-fryer
Users that are interested in gpu-fryer are comparing it to the libraries listed below
Sorting:
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β441Updated last week
 - π· Build compute kernelsβ163Updated last week
 - A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ194Updated 5 months ago
 - Load compute kernels from the Hubβ308Updated last week
 - PyTorch Single Controllerβ840Updated this week
 - β174Updated last year
 - Write a fast kernel and run it on Discord. See how you compare against the best!β58Updated 3 weeks ago
 - FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β301Updated this week
 - Simple MPI implementation for prototyping or learningβ287Updated 2 months ago
 - β225Updated 2 weeks ago
 - Inference server benchmarking toolβ121Updated last month
 - Scalable and Performant Data Loadingβ330Updated this week
 - Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUsβ666Updated last week
 - Dion optimizer algorithmβ374Updated last month
 - A tool to configure, launch and manage your machine learning experiments.β203Updated this week
 - β218Updated 9 months ago
 - π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β270Updated 3 months ago
 - PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ137Updated last month
 - Best practices & guides on how to write distributed pytorch training codeβ526Updated last week
 - β89Updated last year
 - PyTorch-native post-training at scaleβ479Updated this week
 - Slides, notes, and materials for the workshopβ333Updated last year
 - Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β147Updated 2 years ago
 - JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welβ¦β387Updated 4 months ago
 - β262Updated 2 weeks ago
 - Learn CUDA with PyTorchβ95Updated last month
 - PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)β66Updated 7 months ago
 - β231Updated 4 months ago
 - Quantized LLM training in pure CUDA/C++.β209Updated this week
 - A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.β425Updated 7 months ago