meta-pytorch/torchft

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/meta-pytorch/torchft)

meta-pytorch / torchft

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

☆478

Alternatives and similar repositories for torchft

Users that are interested in torchft are comparing it to the libraries listed below

Sorting:

NVIDIA / nvidia-resiliency-ext
View on GitHub
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆264Updated this week
pytorch / torchtitan
View on GitHub
A PyTorch native platform for training generative AI models
☆5,098Updated this week
foundation-model-stack / fms-fsdp
View on GitHub
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆280Nov 24, 2025Updated 3 months ago
lianakoleva / no-libtorch-compile
View on GitHub
☆21Mar 3, 2025Updated 11 months ago
pytorch / ao
View on GitHub
PyTorch native quantization and sparsity for training and inference
☆2,707Updated this week
meta-pytorch / torchsnapshot
View on GitHub
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…
☆164Jan 12, 2026Updated last month
volcengine / veScale
View on GitHub
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
☆938Nov 27, 2025Updated 3 months ago
facebookresearch / HolisticTraceAnalysis
View on GitHub
A library to analyze PyTorch traces.
☆467Feb 4, 2026Updated 3 weeks ago
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,579Feb 19, 2026Updated last week
meta-pytorch / attention-gym
View on GitHub
Helpful tools and examples for working with flex-attention
☆1,136Feb 8, 2026Updated 3 weeks ago
meta-pytorch / monarch
View on GitHub
PyTorch Single Controller
☆981Updated this week
GindaChen / FlexFlashAttention3
View on GitHub
FlexAttention w/ FlashAttention3 Support
☆27Oct 5, 2024Updated last year
meta-pytorch / tritonbench
View on GitHub
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆327Updated this week
mgmalek / efficient_cross_entropy
View on GitHub
☆124May 28, 2024Updated last year
meta-pytorch / torchx
View on GitHub
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…
☆417Updated this week
PrimeIntellect-ai / pccl
View on GitHub
PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP
☆143Sep 12, 2025Updated 5 months ago
cchan / tccl
View on GitHub
extensible collectives library in triton
☆95Mar 31, 2025Updated 11 months ago
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,176Updated this week
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆433Feb 1, 2026Updated last month
BobMcDear / attorch
View on GitHub
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆595Aug 12, 2025Updated 6 months ago
meta-pytorch / float8_experimental
View on GitHub
This repository contains the experimental PyTorch native float8 training UX
☆226Aug 1, 2024Updated last year
pytorch / kineto
View on GitHub
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
☆922Updated this week
haileyschoelkopf / triton-index
View on GitHub
See https://github.com/cuda-mode/triton-index/ instead!
☆11May 8, 2024Updated last year
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,371Feb 13, 2026Updated 2 weeks ago
NVIDIA / cuda-checkpoint
View on GitHub
CUDA checkpoint and restore utility
☆424Sep 15, 2025Updated 5 months ago
pytorch / torchdistx
View on GitHub
Torch Distributed Experimental
☆117Aug 5, 2024Updated last year
zhuzilin / ring-flash-attention
View on GitHub
Ring attention implementation with flash attention
☆986Sep 10, 2025Updated 5 months ago
meta-pytorch / torchfix
View on GitHub
TorchFix - a linter for PyTorch-using code with autofix support
☆152Aug 23, 2025Updated 6 months ago
facebookresearch / spdl
View on GitHub
Scalable and Performant Data Loading
☆368Updated this week
sail-sg / zero-bubble-pipeline-parallelism
View on GitHub
Zero Bubble Pipeline Parallelism
☆451May 7, 2025Updated 9 months ago
pytorch / PiPPy
View on GitHub
Pipeline Parallelism for PyTorch
☆785Aug 21, 2024Updated last year
meta-pytorch / autoparallel
View on GitHub
An experimental implementation of compiler-driven automatic sharding of models across a given device mesh.
☆52Updated this week
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆97Sep 19, 2025Updated 5 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Jun 28, 2025Updated 8 months ago
linkedin / Liger-Kernel
View on GitHub
Efficient Triton Kernels for LLM Training
☆6,162Updated this week
meta-pytorch / tlparse
View on GitHub
TORCH_TRACE parser for PT2
☆78Updated this week
imbue-ai / cluster-health
View on GitHub
☆323Aug 20, 2024Updated last year
HazyResearch / ThunderKittens
View on GitHub
Tile primitives for speedy kernels
☆3,183Updated this week
apple / ml-cross-entropy
View on GitHub
☆580Sep 23, 2025Updated 5 months ago