awslabs / Lancet-Accelerating-MoE-Training-via-Whole-Graph-Computation-Communication-OverlappingLinks

Official implementation for the paper Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, published in MLSys'24.

☆14

Alternatives and similar repositories for Lancet-Accelerating-MoE-Training-via-Whole-Graph-Computation-Communication-Overlapping

Users that are interested in Lancet-Accelerating-MoE-Training-via-Whole-Graph-Computation-Communication-Overlapping are comparing it to the libraries listed below

Sorting:

parasailteam / coconet
☆81Updated 2 years ago
mutinifni / splitwise-sim
LLM serving cluster simulator
☆108Updated last year
Raphael-Hao / brainstorm
Compiler for Dynamic Neural Networks
☆46Updated last year
HPMLL / NVIDIA-Hopper-Benchmark
☆55Updated 3 months ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆118Updated last week
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago
pku-liang / MAGIS
MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)
☆53Updated last year
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆161Updated this week
microsoft / SparTA
☆150Updated last year
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆48Updated this week
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆132Updated 7 months ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
sunlex0717 / DissectingTensorCores
☆106Updated last year
SJTU-IPADS / reef
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU sche…
☆100Updated 2 years ago
zhaiyi000 / tlm
☆42Updated last year
shenh10 / DeepSeek_Simulator
☆84Updated 5 months ago
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆66Updated 5 months ago
UMass-LIDS / Proteus
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
☆13Updated last year
AlibabaResearch / mononn
☆28Updated last year
microsoft / taccl
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
☆75Updated 2 years ago
microsoft / msccl-tools
Synthesizer for optimal collective communication algorithms
☆116Updated last year
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆58Updated last week
eth-easl / orion
An interference-aware scheduler for fine-grained GPU sharing
☆145Updated 7 months ago
Thesys-lab / Helix-ASPLOS25
Open-source implementation for "Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow"
☆63Updated 9 months ago
humuyan / Korch
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
☆38Updated 5 months ago
eniac / paella
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
☆62Updated last year
mcrl / tccl
Thunder Research Group's Collective Communication Library
☆41Updated last month
zhaiyi000 / tlp
☆41Updated last year
awslabs / optimizing-multitask-training-through-dynamic-pipelines
Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
☆20Updated last year
sitar-lab / NeuSight
☆51Updated 2 months ago