A resilient distributed training framework
☆97Apr 11, 2024Updated last year
Alternatives and similar repositories for Oobleck
Users that are interested in Oobleck are comparing it to the libraries listed below
Sorting:
- ☆26Aug 31, 2023Updated 2 years ago
- A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup☆35Jan 9, 2023Updated 3 years ago
- Bamboo is a system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances.☆55Dec 11, 2022Updated 3 years ago
- ☆24Aug 15, 2023Updated 2 years ago
- LLM checkpointing for DeepSpeed/Megatron☆25Nov 30, 2025Updated 3 months ago
- [ACM EuroSys 2023] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access☆56Aug 6, 2025Updated 6 months ago
- ☆22Apr 22, 2024Updated last year
- Releasing the spot availability traces used in "Can't Be Late" paper.☆24Mar 31, 2024Updated last year
- ☆251Jul 25, 2024Updated last year
- Measure and optimize the energy consumption of your AI applications!☆336Updated this week
- An Attention Superoptimizer☆22Jan 20, 2025Updated last year
- A Generic Resource-Aware Hyperparameter Tuning Execution Engine☆15Jan 8, 2022Updated 4 years ago
- ☆44Jul 4, 2024Updated last year
- ☆82Feb 11, 2026Updated 2 weeks ago
- Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines☆19Dec 8, 2023Updated 2 years ago
- ☆198Aug 31, 2019Updated 6 years ago
- A Framework for Automated Validation of Deep Learning Training Tasks☆62Feb 13, 2026Updated 2 weeks ago
- ☆19May 4, 2023Updated 2 years ago
- [ICML 2024] Serving LLMs on heterogeneous decentralized clusters.☆34May 6, 2024Updated last year
- Dynamic resources changes for multi-dimensional parallelism training☆30Aug 22, 2025Updated 6 months ago
- Advanced Topics on Systems for X☆282Jul 10, 2024Updated last year
- Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020☆137Jul 25, 2024Updated last year
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆20Feb 23, 2024Updated 2 years ago
- Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters with at Scale☆19May 27, 2020Updated 5 years ago
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving (EuroMLSys 2024)☆19May 28, 2024Updated last year
- ☆56Jan 25, 2021Updated 5 years ago
- ☆11Jul 9, 2023Updated 2 years ago
- Helios Traces from SenseTime☆61Sep 27, 2022Updated 3 years ago
- Fine-grained GPU sharing primitives☆148Jul 28, 2025Updated 7 months ago
- Oort: Efficient Federated Learning via Guided Participant Selection☆133Oct 27, 2021Updated 4 years ago
- Zodiac: Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs, SOSP 2024☆15Nov 28, 2024Updated last year
- RPCNIC: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator [HPCA2025]☆13Dec 9, 2024Updated last year
- Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …☆36Aug 29, 2025Updated 6 months ago
- ☆52Dec 13, 2022Updated 3 years ago
- Federated Learning Systems Paper List☆75Feb 7, 2024Updated 2 years ago
- An interference-aware scheduler for fine-grained GPU sharing☆159Nov 26, 2025Updated 3 months ago
- An experimental parallel training platform☆56Mar 25, 2024Updated last year
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆164Jan 12, 2026Updated last month
- ☆48Jul 13, 2024Updated last year