huggingface / gpu-fryerLinks
Where GPUs get cooked π©βπ³π₯
β345Updated 3 months ago
Alternatives and similar repositories for gpu-fryer
Users that are interested in gpu-fryer are comparing it to the libraries listed below
Sorting:
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β465Updated last week
- π· Build compute kernelsβ198Updated 2 weeks ago
- Load compute kernels from the Hubβ357Updated 3 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ195Updated 7 months ago
- Inference server benchmarking toolβ135Updated 3 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β329Updated 2 months ago
- MoE training for Me and You and maybe other peopleβ315Updated this week
- β225Updated last month
- PyTorch Single Controllerβ939Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!β66Updated 2 weeks ago
- Simple MPI implementation for prototyping or learningβ297Updated 5 months ago
- β178Updated last year
- Scalable and Performant Data Loadingβ360Updated last week
- Best practices & guides on how to write distributed pytorch training codeβ562Updated 2 months ago
- β91Updated last year
- Dion optimizer algorithmβ413Updated this week
- A tool to configure, launch and manage your machine learning experiments.β212Updated last week
- β219Updated 11 months ago
- ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Oβ¦β325Updated 3 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β277Updated last month
- β275Updated this week
- Slides, notes, and materials for the workshopβ337Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β154Updated 2 years ago
- Simple & Scalable Pretraining for Neural Architecture Researchβ306Updated last month
- β214Updated 2 weeks ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ141Updated 3 months ago
- Google TPU optimizations for transformers modelsβ133Updated 2 weeks ago
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welβ¦β398Updated 6 months ago
- Quantized LLM training in pure CUDA/C++.β230Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β351Updated 8 months ago