AlibabaPAI / FlashModelsLinks

Fast and easy distributed model training examples.

☆13

Alternatives and similar repositories for FlashModels

Users that are interested in FlashModels are comparing it to the libraries listed below

Sorting:

AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆51Updated 5 months ago
ColfaxResearch / cfx-article-src
☆129Updated 3 months ago
alibaba / TePDist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
☆94Updated 2 years ago
ColfaxResearch / cutlass-kernels
☆228Updated last year
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆218Updated last month
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 3 weeks ago
reed-lau / cute-gemm
☆128Updated 8 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆61Updated last year
ConnollyLeon / awesome-Auto-Parallelism
A baseline repository of Auto-Parallelism in Training Neural Networks
☆144Updated 3 years ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆134Updated 3 weeks ago
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 7 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆370Updated 10 months ago
Cambricon / triton-linalg
Development repository for the Triton-Linalg conversion
☆190Updated 6 months ago
NVIDIA-Merlin / HierarchicalKV
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…
☆163Updated this week
OpenPPL / ppl.nn.llm
☆139Updated last year
CalebDu / Awesome-Cute
☆91Updated 2 months ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆261Updated last week
DeepLink-org / DIOPI
☆72Updated 8 months ago
AlibabaPAI / FLASHNN
☆96Updated 11 months ago
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆50Updated this week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆415Updated 3 months ago
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆341Updated 3 years ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆114Updated this week
Shenggan / awesome-distributed-ml
A curated list of awesome projects and papers for distributed training or inference
☆241Updated 10 months ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆394Updated this week
ByteDance-Seed / Triton-distributed
Distributed Compiler based on Triton for Parallel Systems
☆941Updated this week
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆256Updated this week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆155Updated last month