ailzhang / minPPLinks
Pipeline parallelism for the minimalist
☆33Updated 3 weeks ago
Alternatives and similar repositories for minPP
Users that are interested in minPP are comparing it to the libraries listed below
Sorting:
- [WIP] Better (FP8) attention for Hopper☆32Updated 6 months ago
- ring-attention experiments☆150Updated 10 months ago
- ☆163Updated last year
- 👷 Build compute kernels☆119Updated last week
- A collection of reproducible inference engine benchmarks☆32Updated 4 months ago
- ☆88Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- extensible collectives library in triton☆88Updated 5 months ago
- ☆92Updated 3 months ago
- Load compute kernels from the Hub☆258Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆51Updated last week
- Make triton easier☆47Updated last year
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆182Updated 3 weeks ago
- ☆74Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆214Updated 3 months ago
- ☆234Updated 2 weeks ago
- A bunch of kernels that might make stuff slower 😉☆58Updated 2 weeks ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆392Updated this week
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆52Updated 3 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 11 months ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- Fast low-bit matmul kernels in Triton☆356Updated last week
- A block oriented training approach for inference time optimization.☆34Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆191Updated 3 months ago
- Experiment of using Tangent to autodiff triton☆80Updated last year
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆159Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆292Updated last week
- ☆85Updated last week
- ☆75Updated 8 months ago