NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.
☆90Updated last week
Alternatives and similar repositories for nvidia-resiliency-ext:
Users that are interested in nvidia-resiliency-ext are comparing it to the libraries listed below
- Applied AI experiments and examples for PyTorch☆223Updated this week
- A library to analyze PyTorch traces.☆331Updated this week
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆154Updated 2 months ago
- extensible collectives library in triton☆82Updated 4 months ago
- Zero Bubble Pipeline Parallelism☆334Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆294Updated this week
- ☆42Updated last month
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆115Updated last year
- ☆180Updated 7 months ago
- NCCL Profiling Kit☆127Updated 7 months ago
- ☆67Updated 3 months ago
- Microsoft Collective Communication Library☆332Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆128Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆187Updated this week
- Fast low-bit matmul kernels in Triton☆231Updated this week
- A fast communication-overlapping library for tensor parallelism on GPUs.☆295Updated 3 months ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆302Updated this week
- Synthesizer for optimal collective communication algorithms☆103Updated 10 months ago
- ☆175Updated this week
- Experimental projects related to TensorRT☆89Updated this week
- CUDA checkpoint and restore utility☆288Updated 2 weeks ago
- ☆74Updated 2 years ago
- Efficient and easy multi-instance LLM serving☆288Updated this week
- RDMA and SHARP plugins for nccl library☆175Updated 3 weeks ago
- ☆70Updated 3 years ago
- Python bindings for NVTX☆66Updated last year
- Microsoft Collective Communication Library☆61Updated 2 months ago
- ☆141Updated 2 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week