NVIDIA / nvidia-resiliency-extLinks

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

☆226

Alternatives and similar repositories for nvidia-resiliency-ext

Users that are interested in nvidia-resiliency-ext are comparing it to the libraries listed below

Sorting:

facebookresearch / HolisticTraceAnalysis
A library to analyze PyTorch traces.
☆416Updated last week
google / nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
☆121Updated last year
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆673Updated this week
yifuwang / symm-mem-recipes
☆141Updated 9 months ago
ai-dynamo / aiconfigurator
Offline optimization of your disaggregated Dynamo graph
☆79Updated this week
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆497Updated last month
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆425Updated this week
facebookresearch / param
PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…
☆153Updated last week
uccl-project / uccl
Ultra and Unified CCL
☆595Updated this week
microsoft / NPKit
NCCL Profiling Kit
☆145Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
microsoft / msccl
Microsoft Collective Communication Library
☆367Updated 2 years ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆431Updated this week
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆114Updated last week
NVIDIA / nvbandwidth
A tool for bandwidth measurements on NVIDIA GPUs.
☆550Updated 6 months ago
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆432Updated 5 months ago
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆376Updated last month
mk1-project / quickreduce
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
☆33Updated last month
perplexityai / libfabric-efa-demo
☆71Updated 8 months ago
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆220Updated this week
Mellanox / nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
☆209Updated last month
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆426Updated 4 months ago
Azure / msccl
Microsoft Collective Communication Library
☆66Updated 11 months ago
ColfaxResearch / cutlass-kernels
☆240Updated last year
Deep-Learning-Profiling-Tools / triton-viz
☆240Updated this week
imbue-ai / cluster-health
☆316Updated last year
aws / aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
☆186Updated this week
AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆497Updated last month
cchan / tccl
extensible collectives library in triton
☆89Updated 6 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated this week