NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.
☆151Updated this week
Alternatives and similar repositories for nvidia-resiliency-ext:
Users that are interested in nvidia-resiliency-ext are comparing it to the libraries listed below
- A library to analyze PyTorch traces.☆367Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆304Updated this week
- ☆70Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆262Updated last week
- NCCL Profiling Kit☆133Updated 10 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- Perplexity GPU Kernels☆272Updated this week
- ☆49Updated last month
- ☆202Updated 9 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆345Updated this week
- Microsoft Collective Communication Library☆344Updated last year
- NVIDIA NCCL Tests for Distributed Training☆88Updated last week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆137Updated this week
- extensible collectives library in triton☆85Updated last month
- Experimental projects related to TensorRT☆99Updated this week
- ☆304Updated 8 months ago
- RDMA and SHARP plugins for nccl library☆191Updated 3 weeks ago
- Synthesizer for optimal collective communication algorithms☆106Updated last year
- Zero Bubble Pipeline Parallelism☆386Updated 3 weeks ago
- ☆104Updated last month
- Fast low-bit matmul kernels in Triton☆295Updated this week
- ☆202Updated last week
- Microsoft Collective Communication Library☆65Updated 5 months ago
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆157Updated 4 months ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 9 months ago
- A low-latency & high-throughput serving engine for LLMs☆351Updated 2 weeks ago
- ☆78Updated 5 months ago
- A tensor-aware point-to-point communication primitive for machine learning☆257Updated 2 years ago
- ☆79Updated 2 years ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆24Updated last month