CUDA checkpoint and restore utility
☆434Sep 15, 2025Updated 6 months ago
Alternatives and similar repositories for cuda-checkpoint
Users that are interested in cuda-checkpoint are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- cricket is a virtualization solution for GPUs☆238Sep 9, 2025Updated 7 months ago
- Fast OS-level support for GPU checkpoint and restore☆280Sep 28, 2025Updated 6 months ago
- A tool for coordinated checkpoint/restore of distributed applications with CRIU☆32Mar 2, 2026Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆171Dec 12, 2023Updated 2 years ago
- Orchestrated process and container checkpointing☆123Updated this week
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- ☆20Jul 10, 2025Updated 9 months ago
- HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container☆291Apr 3, 2026Updated last week
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- Checkpoint/Restore tool☆3,770Apr 2, 2026Updated last week
- NVIDIA DRA Driver for GPUs☆619Updated this week
- ☆47Dec 13, 2024Updated last year
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆701Mar 30, 2026Updated last week
- Artifacts for our NSDI'23 paper TGS☆97Jun 10, 2024Updated last year
- A tool for bandwidth measurements on NVIDIA GPUs.☆659Updated this week
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆491Apr 3, 2026Updated last week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆283Updated this week
- Checkpoint and Restore in Kubernetes☆166May 15, 2024Updated last year
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,360Mar 12, 2026Updated 3 weeks ago
- Go Bindings for the NVIDIA Management Library (NVML)☆430Feb 12, 2026Updated last month
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆479Updated this week
- NCCL Tests☆1,480Mar 11, 2026Updated 3 weeks ago
- DLRover: An Automatic Distributed Deep Learning System☆1,641Apr 2, 2026Updated last week
- ☆539Jun 7, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Optimized primitives for collective multi-GPU communication☆10May 8, 2024Updated last year
- GLake: optimizing GPU memory management and IO transmission.☆501Mar 24, 2025Updated last year
- Module, Model, and Tensor Serialization/Deserialization☆297Feb 6, 2026Updated 2 months ago
- Go Bindings for CRIU☆234Mar 28, 2026Updated 2 weeks ago
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆323Aug 20, 2024Updated last year
- NVIDIA Inference Xfer Library (NIXL)☆970Updated this week
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆13Nov 23, 2024Updated last year
- NVIDIA's launch, startup, and logging scripts used by our MLPerf Training and HPC submissions☆39Sep 12, 2025Updated 6 months ago
- NVIDIA GPUDirect Storage Driver☆340Mar 18, 2026Updated 3 weeks ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- A Datacenter Scale Distributed Inference Serving Framework☆6,527Updated this week
- The NVIDIA Driver Manager is a Kubernetes component which assist in seamless upgrades of NVIDIA Driver on each node of the cluster.☆52Mar 30, 2026Updated last week
- Microsoft Collective Communication Library☆389Sep 20, 2023Updated 2 years ago
- RDMA and SHARP plugins for nccl library☆225Apr 3, 2026Updated last week
- ☆294Mar 19, 2026Updated 3 weeks ago
- Efficient and easy multi-instance LLM serving☆541Mar 12, 2026Updated 3 weeks ago
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆883Sep 26, 2025Updated 6 months ago