sail-sg / zero-bubble-pipeline-parallelismLinks
Zero Bubble Pipeline Parallelism
☆445Updated 8 months ago
Alternatives and similar repositories for zero-bubble-pipeline-parallelism
Users that are interested in zero-bubble-pipeline-parallelism are comparing it to the libraries listed below
Sorting:
- PyTorch bindings for CUTLASS grouped GEMM.☆180Updated 3 weeks ago
- Allow torch tensor memory to be released and resumed later☆199Updated last month
- A collection of memory efficient attention operators implemented in the Triton language.☆287Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttention☆454Updated 7 months ago
- LLM training technologies developed by kwai☆69Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆626Updated this week
- nnScaler: Compiling DNN models for Parallel Training☆123Updated 3 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆621Updated 3 weeks ago
- Pipeline Parallelism Emulation and Visualization☆74Updated last week
- Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.☆149Updated 3 weeks ago
- ☆338Updated last week
- Latency and Memory Analysis of Transformer Models for Training and Inference☆475Updated 8 months ago
- A low-latency & high-throughput serving engine for LLMs☆464Updated last week
- Perplexity GPU Kernels☆552Updated 2 months ago
- ☆154Updated last year
- Accelerating MoE with IO and Tile-aware Optimizations☆522Updated last week
- Disaggregated serving system for Large Language Models (LLMs).☆761Updated 9 months ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆220Updated last year
- Microsoft Automatic Mixed Precision Library☆635Updated last month
- Ring attention implementation with flash attention☆961Updated 4 months ago
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆917Updated last month
- Distributed Compiler based on Triton for Parallel Systems☆1,313Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆140Updated 7 months ago
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆258Updated last month
- Materials for learning SGLang☆714Updated last week
- Applied AI experiments and examples for PyTorch☆312Updated 4 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆473Updated 8 months ago
- ☆153Updated 10 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆147Updated 3 years ago
- ☆255Updated last year