lsds/KungFu

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lsds/KungFu)

lsds / KungFu

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

☆295

Alternatives and similar repositories for KungFu

Users that are interested in KungFu are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

alibaba / GPU-scheduler-for-deep-learning
View on GitHub
GPU-scheduler-for-deep-learning
☆214Nov 5, 2020Updated 5 years ago
SymbioticLab / Salus
View on GitHub
Fine-grained GPU sharing primitives
☆149Jul 28, 2025Updated 11 months ago
lsds / Crossbow
View on GitHub
Crossbow: A Multi-GPU Deep Learning System for Training with Small Batch Sizes
☆57Oct 5, 2022Updated 3 years ago
petuum / autodist
View on GitHub
Simple Distributed Deep Learning on TensorFlow
☆136Feb 5, 2026Updated 5 months ago
petuum / adaptdl
View on GitHub
Resource-adaptive cluster scheduler for deep learning training.
☆459Mar 5, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
stanford-futuredata / gavel
View on GitHub
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
☆139Jul 25, 2024Updated last year
SymbioticLab / Tiresias
View on GitHub
Tiresias is a GPU cluster manager for distributed deep learning training.
☆166May 7, 2020Updated 6 years ago
uw-mad-dash / Accordion
View on GitHub
Code for reproducing experiments performed for Accoridon
☆13Jun 11, 2021Updated 5 years ago
microsoft / hivedscheduler
View on GitHub
Kubernetes Scheduler for Deep Learning
☆263May 22, 2022Updated 4 years ago
netx-repo / PipeSwitch
View on GitHub
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
☆127May 9, 2022Updated 4 years ago
netx-repo / training-bottleneck
View on GitHub
Analyze network performance in distributed training
☆20Oct 20, 2020Updated 5 years ago
Funatiq / gossip
View on GitHub
gossip: Efficient Communication Primitives for Multi-GPU Systems
☆62Jul 1, 2022Updated 4 years ago
jiazhihao / attention_superoptimizer
View on GitHub
An Attention Superoptimizer
☆22Jan 20, 2025Updated last year
msr-fiddle / philly-traces
View on GitHub
☆199Aug 31, 2019Updated 6 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
anandj91 / p3
View on GitHub
☆21Nov 29, 2022Updated 3 years ago
bytedance / byteps
View on GitHub
A high performance and generic framework for distributed DNN training
☆3,717Oct 3, 2023Updated 2 years ago
msr-fiddle / pipedream
View on GitHub
☆394Nov 4, 2022Updated 3 years ago
quiver-team / torch-quiver
View on GitHub
PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.
☆304Aug 17, 2023Updated 2 years ago
marius-team / marius
View on GitHub
Large scale graph learning on a single machine.
☆167Feb 25, 2025Updated last year
zhuangwang93 / Espresso
View on GitHub
Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies (EuroSys '2…
☆15Sep 21, 2023Updated 2 years ago
thu-pacman / PET
View on GitHub
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆126Jun 23, 2022Updated 4 years ago
HKBU-HPML / ddl-benchmarks
View on GitHub
ddl-benchmarks: Benchmarks for Distributed Deep Learning
☆36May 29, 2020Updated 6 years ago
saareliad / FTPipe
View on GitHub
FTPipe and related pipeline model parallelism research.
☆44May 16, 2023Updated 3 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
sands-lab / grace
View on GitHub
GRACE - GRAdient ComprEssion for distributed deep learning
☆141Jul 23, 2024Updated last year
kleveross / ftlib
View on GitHub
Fault-tolerant for DL frameworks
☆71Jul 5, 2023Updated 3 years ago
alibaba / EasyParallelLibrary
View on GitHub
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
☆272Mar 31, 2023Updated 3 years ago
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,002Sep 19, 2024Updated last year
linnanwang / superneurons-release
View on GitHub
this is the release repository of superneurons
☆54Feb 13, 2021Updated 5 years ago
snuspl / nimble
View on GitHub
Lightweight and Parallel Deep Learning Framework
☆263Nov 26, 2022Updated 3 years ago
mcanini / SysML-reading-list
View on GitHub
Systems for ML/AI & ML/AI for Systems paper reading list: A curated reading list of computer science research for work at the intersectio…
☆286Jun 9, 2025Updated last year
flexflow / flexflow-train
View on GitHub
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,896Jul 1, 2026Updated 2 weeks ago
zhuohan123 / terapipe
View on GitHub
☆79May 4, 2021Updated 5 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
kzhang28 / Optimus
View on GitHub
An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
☆41Oct 28, 2017Updated 8 years ago
uw-mad-dash / shockwave
View on GitHub
Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
☆46Nov 24, 2022Updated 3 years ago
facebookresearch / stochastic_gradient_push
View on GitHub
Stochastic Gradient Push for Distributed Deep Learning
☆172Apr 5, 2023Updated 3 years ago
alexrenz / AdaPM
View on GitHub
A fully adaptive, zero-tuning parameter manager that enables efficient distributed machine learning training
☆21Feb 23, 2023Updated 3 years ago
alpa-projects / alpa
View on GitHub
Training and serving large-scale neural networks with auto parallelization.
☆3,178Dec 9, 2023Updated 2 years ago
parasj / checkmate
View on GitHub
Training neural networks in TensorFlow 2.0 with 5x less memory
☆137Feb 21, 2022Updated 4 years ago
sands-lab / omnireduce
View on GitHub
☆69Mar 14, 2023Updated 3 years ago