ByteDance-Seed/StragglerAnalysis

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ByteDance-Seed/StragglerAnalysis)

ByteDance-Seed / StragglerAnalysis

☆56

Alternatives and similar repositories for StragglerAnalysis

Users that are interested in StragglerAnalysis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

eunomia-bpf / nccl-eBPF
View on GitHub
☆20Jul 7, 2026Updated last week
SophiaLi06 / BytePS_THC
View on GitHub
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
☆20Jul 30, 2024Updated last year
ByteDance-Seed / ByteCheckpoint
View on GitHub
ByteCheckpoint: An Unified Checkpointing Library for LFMs
☆286Feb 2, 2026Updated 5 months ago
astra-sim / stage
View on GitHub
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
☆48Jun 27, 2026Updated 3 weeks ago
JF-D / Parcae
View on GitHub
☆22Apr 22, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
spcl / atlahs
View on GitHub
ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage
☆94May 12, 2026Updated 2 months ago
volcengine / veScale
View on GitHub
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
☆1,031Mar 3, 2026Updated 4 months ago
romitjain / kachua-mlsys
View on GitHub
[MLSys 26] 🥇 Solution for Gated Delta Net Track of MLSys 26 Flash infer competition
☆35May 22, 2026Updated last month
raywan-110 / AdaQP
View on GitHub
Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
☆24Mar 1, 2024Updated 2 years ago
chenyu-jiang / dcp
View on GitHub
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
☆21Nov 28, 2025Updated 7 months ago
FlexFusion / FlexFusion
View on GitHub
The official implementation for the intra-stage fusion technique introduced in https://arxiv.org/abs/2409.13221
☆31Apr 22, 2025Updated last year
TransferQueue / TransferQueue
View on GitHub
[Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…
☆16Jan 16, 2026Updated 6 months ago
verl-project / vexact
View on GitHub
verl Zero-Mismatch Dense/MoE HuggingFace Rollout
☆61Updated this week
eth-easl / sailor
View on GitHub
AI model training on heterogeneous, geo-distributed resources
☆46Nov 24, 2025Updated 7 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
kungfu-team / tenplex
View on GitHub
Dynamic resources changes for multi-dimensional parallelism training
☆31Aug 22, 2025Updated 10 months ago
ConnollyLeon / awesome-Auto-Parallelism
View on GitHub
A baseline repository of Auto-Parallelism in Training Neural Networks
☆145Jun 25, 2022Updated 4 years ago
mlc-ai / pith-train
View on GitHub
Compact and Agent-Native MoE Training System
☆288Updated this week
inclusionAI / asystem-amem
View on GitHub
A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
☆110Dec 17, 2025Updated 7 months ago
radixark / miles_diffusion
View on GitHub
[Experimental] Miles-diffusion is an post-training framework for large-scale diffusion model training and production workloads, forked fr…
☆21Updated this week
infinigence / HamiltonAttention
View on GitHub
☆45Oct 15, 2025Updated 9 months ago
suquark / hoplite
View on GitHub
☆43Sep 6, 2021Updated 4 years ago
microsoft / nnscaler
View on GitHub
nnScaler: Compiling DNN models for Parallel Training
☆135Jul 2, 2026Updated 2 weeks ago
foundation-model-stack / vllm-triton-backend
View on GitHub
A Triton-only attention backend for vLLM
☆27Updated this week
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
CentML / lorafusion
View on GitHub
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs
☆28Jul 2, 2026Updated 2 weeks ago
SALT-NLP / search_privacy_risk
View on GitHub
Code for the paper "Searching Privacy Risks in Multi-Agent Systems via Simulation"
☆24Oct 13, 2025Updated 9 months ago
msr-fiddle / blox
View on GitHub
☆46Jul 4, 2024Updated 2 years ago
MLSysOps / InfraGym
View on GitHub
Empowering LLM Agents for Real-World Computer System Optimization
☆17Sep 10, 2025Updated 10 months ago
aliyun / syccl
View on GitHub
☆24Sep 10, 2025Updated 10 months ago
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆32Updated this week
eunomia-bpf / cupti-tutorial
View on GitHub
Tutorials for NVIDIA CUPTI samples
☆70Updated this week
byungsoo-oh / ml-systems-papers
View on GitHub
Curated collection of papers in machine learning systems
☆632Feb 7, 2026Updated 5 months ago
pytorch-fdn / foundation-hosted
View on GitHub
This repository defines the intake process for open-source projects seeking to become officially hosted under the PyTorch Foundation, an …
☆19Feb 2, 2026Updated 5 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 5 months ago
liu445126256 / FuncPipe
View on GitHub
☆11Jul 9, 2023Updated 3 years ago
astra-sim / tacos
View on GitHub
TACOS: [T]opology-[A]ware [Co]llective Algorithm [S]ynthesizer for Distributed Machine Learning
☆37Jun 13, 2025Updated last year
meta-pytorch / remat
View on GitHub
torch_remat fine-grained activation checkpointing API
☆15Updated this week
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,343Aug 28, 2025Updated 10 months ago
stepfun-ai / StepMesh
View on GitHub
☆377Jan 28, 2026Updated 5 months ago
hangxu0304 / DeepReduce
View on GitHub
A Sparse-tensor Communication Framework for Distributed Deep Learning
☆13Nov 1, 2021Updated 4 years ago