alibaba / TePDistView external linksLinks
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
☆99Apr 22, 2023Updated 2 years ago
Alternatives and similar repositories for TePDist
Users that are interested in TePDist are comparing it to the libraries listed below
Sorting:
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆82Nov 19, 2024Updated last year
- BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.☆916Dec 30, 2024Updated last year
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆161Apr 20, 2024Updated last year
- PyTorch distributed training acceleration framework☆55Aug 13, 2025Updated 6 months ago
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 6 months ago
- these are custom recipes of nvidia nsight system post collection analysis.☆16Nov 7, 2025Updated 3 months ago
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆271Mar 31, 2023Updated 2 years ago
- A model compilation solution for various hardware☆464Aug 20, 2025Updated 5 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆147Jun 25, 2022Updated 3 years ago
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆192Updated this week
- DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foun…☆1,165Jan 21, 2025Updated last year
- ☆12Mar 13, 2023Updated 2 years ago
- Cavs: An Efficient Runtime System for Dynamic Neural Networks☆15Sep 18, 2020Updated 5 years ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆477Mar 15, 2024Updated last year
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- A lightweight, Pythonic, frontend for MLIR☆80Oct 21, 2023Updated 2 years ago
- System for automated integration of deep learning backends.☆47Aug 15, 2022Updated 3 years ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training.☆333Dec 13, 2025Updated 2 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆462Updated this week
- nnScaler: Compiling DNN models for Parallel Training☆124Sep 23, 2025Updated 4 months ago
- ☆422Jan 4, 2026Updated last month
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆22Updated this week
- Distributed DRL by Ray and TensorFlow Tutorial.☆10Dec 26, 2019Updated 6 years ago
- Yinghan's Code Sample☆365Jul 25, 2022Updated 3 years ago
- ☆84Feb 6, 2026Updated last week
- ☆219Aug 17, 2023Updated 2 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Sep 13, 2025Updated 5 months ago
- An MLIR-based compiler framework bridges DSLs (domain-specific languages) to DSAs (domain-specific architectures).☆694Feb 2, 2026Updated last week
- An external memory allocator example for PyTorch.☆16Aug 10, 2025Updated 6 months ago
- A flexible and efficient training framework for large-scale alignment tasks☆449Oct 23, 2025Updated 3 months ago
- compiler learning resources collect.☆2,679Mar 19, 2025Updated 10 months ago
- 通过实验对比LLM推理中Prefill和Decoding阶段的吞吐量差异,揭示性能瓶颈,解释PD分离优化技术的原理。包含CUDA和Apple MPS (M系列芯片) 的测试脚本。☆20May 22, 2025Updated 8 months ago
- Large-scale exact string matching tool☆17Mar 7, 2025Updated 11 months ago
- A schedule language for large model training☆152Aug 21, 2025Updated 5 months ago
- ☆158Dec 26, 2024Updated last year
- Zero Bubble Pipeline Parallelism☆449May 7, 2025Updated 9 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- Parallel selection on GPUs☆15Mar 23, 2021Updated 4 years ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Jul 4, 2025Updated 7 months ago