zhuangwang93 / EspressoLinks

Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies (EuroSys '23)

☆15

Alternatives and similar repositories for Espresso

Users that are interested in Espresso are comparing it to the libraries listed below

Sorting:

ParCIS / Ok-Topk
Ok-Topk is a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k c…
☆27Updated 2 years ago
SymbioticLab / ModelKeeper
A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup
☆35Updated 2 years ago
Rivendile / Muri
Artifacts for our SIGCOMM'22 paper Muri
☆43Updated last year
msr-fiddle / synergy
☆51Updated 2 years ago
phoenix-dataplane / mCCS
Managed collective communication service
☆22Updated last year
uw-mad-dash / shockwave
Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
☆45Updated 2 years ago
SophiaLi06 / BytePS_THC
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
☆19Updated last year
microsoft / taccl
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
☆76Updated 2 years ago
JF-D / Proteus
☆23Updated last year
microsoft / TE-CCL
☆41Updated last year
sands-lab / omnireduce
☆68Updated 2 years ago
alibaba / alibaba-lingjun-dataset-2023
☆58Updated last year
casys-kaist / EnvPipe
☆25Updated 2 years ago
alpa-projects / mms
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
☆88Updated 2 years ago
msr-fiddle / blox
☆43Updated last year
suquark / hoplite
☆44Updated 4 years ago
SymbioticLab / Aequitas
Aequitas enables RPC-level QoS in datacenter networks.
☆18Updated 3 years ago
pkusys / ElasticFlow
Artifacts for our ASPLOS'23 paper ElasticFlow
☆54Updated last year
msr-fiddle / CheckFreq
☆56Updated 4 years ago
netx-repo / training-bottleneck
Analyze network performance in distributed training
☆19Updated 5 years ago
gudiandian / ElasticFlow
☆16Updated last year
uclasystem / bamboo
Bamboo is a system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances.
☆51Updated 2 years ago
Thesys-lab / Helix-ASPLOS25
Open-source implementation for "Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow"
☆68Updated last week
Raphael-Hao / Abacus
☆38Updated 3 months ago
netx-repo / PipeSwitch
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
☆126Updated 3 years ago
netiken / m4
[TBD] "m4: A Learned Flow-level Network Simulator" by Chenning Li, Anton A. Zabreyko, Om Chabra, Arash Nasr-Esfahany, Kevin Zhao, Pratees…
☆15Updated last week
hipersys-team / TopoOpt
[NSDI 2023] TopoOpt: Optimizing the Network Topology for Distributed DNN Training
☆34Updated last year
UMass-LIDS / Proteus
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
☆13Updated last year
mlcommons / chakra-old
Repository for MLCommons Chakra schema and tools
☆39Updated last year
LLMServe / dLoRA-artifact
☆27Updated last year