SymbioticLab/ModelKeeper

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/SymbioticLab/ModelKeeper)

SymbioticLab / ModelKeeper

A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup

☆36

Alternatives and similar repositories for ModelKeeper

Users that are interested in ModelKeeper are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SymbioticLab / Fluid
View on GitHub
A Generic Resource-Aware Hyperparameter Tuning Execution Engine
☆15Jan 8, 2022Updated 4 years ago
SymbioticLab / Salus
View on GitHub
Fine-grained GPU sharing primitives
☆149Jul 28, 2025Updated 11 months ago
SymbioticLab / Tiresias
View on GitHub
Tiresias is a GPU cluster manager for distributed deep learning training.
☆166May 7, 2020Updated 6 years ago
dywsjtu / apparate
View on GitHub
Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]
☆24Nov 21, 2024Updated last year
SymbioticLab / Oobleck
View on GitHub
A resilient distributed training framework
☆99Apr 11, 2024Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
uw-mad-dash / Accordion
View on GitHub
Code for reproducing experiments performed for Accoridon
☆13Jun 11, 2021Updated 5 years ago
uw-mad-dash / shockwave
View on GitHub
Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
☆46Nov 24, 2022Updated 3 years ago
stanford-futuredata / gavel
View on GitHub
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
☆139Jul 25, 2024Updated last year
msr-fiddle / philly-traces
View on GitHub
☆199Aug 31, 2019Updated 6 years ago
AmberLJC / FLsystem-paper
View on GitHub
Federated Learning Systems Paper List
☆75Feb 7, 2024Updated 2 years ago
SymbioticLab / Justitia
View on GitHub
Justitia provides RDMA isolation between applications with diverse requirements.
☆43May 25, 2022Updated 4 years ago
S-Lab-System-Group / ChronusArtifact
View on GitHub
☆23Jan 7, 2022Updated 4 years ago
Rivendile / Muri
View on GitHub
Artifacts for our SIGCOMM'22 paper Muri
☆44Dec 29, 2023Updated 2 years ago
casys-kaist / EnvPipe
View on GitHub
☆27Aug 31, 2023Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
gudiandian / ElasticFlow
View on GitHub
☆17May 10, 2024Updated 2 years ago
pengyanghua / optimus
View on GitHub
A Deep Learning Cluster Scheduler
☆36Jan 11, 2021Updated 5 years ago
S-Lab-System-Group / Primo
View on GitHub
Primo: Practical Learning-Augmented Systems with Interpretable Models
☆19Dec 26, 2023Updated 2 years ago
SymbioticLab / Sol
View on GitHub
A Federated Execution Engine for Fast Distributed Computation Over Slow Networks
☆25Apr 26, 2021Updated 5 years ago
netx-repo / training-bottleneck
View on GitHub
Analyze network performance in distributed training
☆20Oct 20, 2020Updated 5 years ago
zhuzilin / pytorch-malloc
View on GitHub
An external memory allocator example for PyTorch.
☆16Aug 10, 2025Updated 11 months ago
lwangbm / Metis
View on GitHub
Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters with at Scale
☆19May 27, 2020Updated 6 years ago
msr-fiddle / synergy
View on GitHub
☆54Dec 13, 2022Updated 3 years ago
S-Lab-System-Group / Hydro
View on GitHub
Surrogate-based Hyperparameter Tuning System
☆30Jun 29, 2023Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
DiT-Serving / TetriServe
View on GitHub
[ASPLOS' 26] TetriServe: Efficiently Serving Mixed DiT Workloads
☆17Mar 12, 2026Updated 4 months ago
PKU-Chengxu / FLASH
View on GitHub
☆48Jun 2, 2022Updated 4 years ago
ruipeterpan / paper_notes
View on GitHub
Personal blog + reading notes on system-ish papers
☆17Oct 29, 2023Updated 2 years ago
hku-systems / naspipe
View on GitHub
☆14Jan 12, 2022Updated 4 years ago
reconfigurable-ml-pipeline / ipa
View on GitHub
Source code of IPA, https://escholarship.org/uc/item/2p0805dq
☆12Jun 27, 2024Updated 2 years ago
HuaizhengZhang / MIGProfiler
View on GitHub
Multi-Instance-GPU profiling tool
☆58Apr 16, 2023Updated 3 years ago
columbia / PrivateKube
View on GitHub
Privacy Budget Orchestration in Machine Learning Workloads (OSDI '21)
☆27Oct 20, 2023Updated 2 years ago
ruipeterpan / torch_profiler
View on GitHub
Simple PyTorch profiler that combines DeepSpeed Flops Profiler and TorchInfo
☆12Feb 12, 2023Updated 3 years ago
microsoft / elasticflow-traces
View on GitHub
Integrated Training Platform (ITP) traces used in ElasticFlow paper.
☆31Dec 23, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
SymbioticLab / Oort
View on GitHub
Oort: Efficient Federated Learning via Guided Participant Selection
☆138Oct 27, 2021Updated 4 years ago
ml-energy / zeus
View on GitHub
Measure and optimize the energy consumption of your AI applications!
☆368Jul 7, 2026Updated last week
hiddenlayer2020 / ML-Job-Scheduler-MLFS
View on GitHub
☆13Dec 18, 2020Updated 5 years ago
ucbrise / hypersched
View on GitHub
Deadline-based hyperparameter tuning on RayTune.
☆32Jan 16, 2020Updated 6 years ago
xinjin / course-net-seminar
View on GitHub
Selected Topics in Computer Networks @ Johns Hopkins University
☆19Dec 17, 2020Updated 5 years ago
gpu2grid / openg2g
View on GitHub
Modular simulation library for AI datacenter-grid interaction
☆15May 11, 2026Updated 2 months ago
pkusys / ElasticFlow
View on GitHub
Artifacts for our ASPLOS'23 paper ElasticFlow
☆56May 10, 2024Updated 2 years ago