NUS-HPC-AI-Lab / oh-my-serverLinks

☆30

Alternatives and similar repositories for oh-my-server

Users that are interested in oh-my-server are comparing it to the libraries listed below

Sorting:

hpcaitech / ColossalAI-Benchmark
Performance benchmarking with ColossalAI
☆38Updated 3 years ago
Hsword / Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …
☆123Updated last year
zhuohan123 / terapipe
☆77Updated 4 years ago
LiuXiaoxuanPKU / GACT-ICML
☆43Updated 3 years ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆169Updated last month
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆218Updated last year
yanring / Megatron-MoE-ModelZoo
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
☆128Updated 2 weeks ago
ByteDance-Seed / ByteCheckpoint
ByteCheckpoint: An Unified Checkpointing Library for LFMs
☆252Updated 4 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
osayamenja / FlashMoE
Distributed MoE in a Single Kernel [NeurIPS '25]
☆145Updated 2 months ago
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆250Updated 3 months ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆120Updated 2 months ago
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆68Updated 8 months ago
thu-pacman / FasterMoE
☆88Updated 3 years ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆71Updated 5 months ago
mit-han-lab / flash-moba
☆187Updated last week
Azure / MS-AMP-Examples
Examples for MS-AMP package.
☆30Updated 4 months ago
uwsampl / dtr-prototype
Dynamic Tensor Rematerialization prototype (modified PyTorch) and simulator. Paper: https://arxiv.org/abs/2006.09616
☆132Updated 2 years ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆151Updated last year
thu-pacman / SmartMoE-AE
ATC23 AE
☆47Updated 2 years ago
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆283Updated last year
attention-survey / Efficient_Attention_Survey
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
☆233Updated 3 months ago
cli99 / flops-profiler
pytorch-profiler
☆51Updated 2 years ago
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆43Updated 3 years ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆163Updated last week
stanford-futuredata / stk
☆113Updated last year
UofT-EcoSystem / Tempo
Memory footprint reduction for transformer models
☆11Updated 2 years ago
Youhe-Jiang / IJCAI2023-OptimalShardedDataParallel
[IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…
☆52Updated 2 years ago
HPDL-Group / Merak
☆80Updated 6 months ago