ailzhang / minPPLinks
Pipeline parallelism for the minimalist
☆35Updated 2 months ago
Alternatives and similar repositories for minPP
Users that are interested in minPP are comparing it to the libraries listed below
Sorting:
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆54Updated last week
- [WIP] Better (FP8) attention for Hopper☆33Updated 7 months ago
- extensible collectives library in triton☆89Updated 6 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆58Updated 2 weeks ago
- How to ensure correctness and ship LLM generated kernels in PyTorch☆66Updated this week
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆74Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆223Updated last year
- Fast low-bit matmul kernels in Triton☆379Updated last week
- ☆89Updated last year
- Load compute kernels from the Hub☆293Updated last week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆230Updated 5 months ago
- ☆173Updated last year
- ☆72Updated 6 months ago
- ring-attention experiments☆152Updated 11 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆414Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆167Updated this week
- ☆99Updated 4 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆214Updated this week
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆193Updated 2 months ago
- ☆242Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 6 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆194Updated 4 months ago
- A bunch of kernels that might make stuff slower 😉☆59Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆252Updated this week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆270Updated this week
- Quantized LLM training in pure CUDA/C++.☆180Updated this week
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer for Triton Kernels☆152Updated last week
- PyTorch RFCs (experimental)☆135Updated 4 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆82Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆144Updated last year