huggingface / gpu-fryerLinks
Where GPUs get cooked π©βπ³π₯
β363Updated 3 weeks ago
Alternatives and similar repositories for gpu-fryer
Users that are interested in gpu-fryer are comparing it to the libraries listed below
Sorting:
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β475Updated last week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ198Updated 8 months ago
- Load compute kernels from the Hubβ397Updated this week
- π· Build compute kernelsβ215Updated 2 weeks ago
- PyTorch Single Controllerβ967Updated this week
- Inference server benchmarking toolβ142Updated 4 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β334Updated 3 months ago
- β232Updated 2 months ago
- Simple MPI implementation for prototyping or learningβ300Updated 6 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!β71Updated this week
- Scalable and Performant Data Loadingβ366Updated last week
- β177Updated 2 years ago
- Dion optimizer algorithmβ431Updated 3 weeks ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β280Updated 2 months ago
- MoE training for Me and You and maybe other peopleβ353Updated this week
- A tool to configure, launch and manage your machine learning experiments.β216Updated last week
- Best practices & guides on how to write distributed pytorch training codeβ576Updated 3 months ago
- β280Updated last week
- β219Updated last year
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ141Updated 5 months ago
- β237Updated last month
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β273Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inferenceβ391Updated this week
- Google TPU optimizations for transformers modelsβ134Updated 2 weeks ago
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welβ¦β404Updated last month
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.β158Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMsβ267Updated 2 months ago
- A zero-to-one guide on scaling modern transformers with n-dimensional parallelism.β115Updated last month
- TPU inference for vLLM, with unified JAX and PyTorch support.β228Updated last week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β357Updated this week