GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆482May 25, 2026Updated this week
Alternatives and similar repositories for gpud
Users that are interested in gpud are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆325Aug 20, 2024Updated last year
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆78Apr 14, 2026Updated last month
- A tool to detect infrastructure issues on cloud native AI systems☆53Sep 18, 2025Updated 8 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆505Apr 3, 2026Updated last month
- CUDA checkpoint and restore utility☆451Sep 15, 2025Updated 8 months ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated compu…☆287May 21, 2026Updated last week
- A Datacenter Scale Distributed Inference Serving Framework☆6,941May 22, 2026Updated last week
- NVIDIA NCCL Tests for Distributed Training☆144May 23, 2026Updated last week
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆373May 21, 2026Updated last week
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆1,279Updated this week
- NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes☆2,715Updated this week
- DRA Driver for NVIDIA GPUs☆651Updated this week
- Kubernetes-native Job Queueing☆2,524Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆291May 21, 2026Updated last week
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆202May 8, 2026Updated 3 weeks ago
- ☆358May 20, 2026Updated last week
- Go Bindings for the NVIDIA Management Library (NVML)☆441May 14, 2026Updated 2 weeks ago
- NCCL Tests☆1,529May 20, 2026Updated last week
- A production-ready remote container image format (overlaybd) and snapshotter based on block-device.☆462Updated this week
- Resource Exporter for volcano scheduling, e.g. NUMA-Aware scheduling.☆19May 30, 2025Updated last year
- NVIDIA device plugin for Kubernetes☆15Sep 9, 2019Updated 6 years ago
- A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, …☆1,692May 22, 2026Updated last week
- Repository for out-of-tree scheduler plugins based on scheduler framework.☆1,293May 11, 2026Updated 2 weeks ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆35Updated this week
- NVIDIA GPU metrics exporter for Prometheus leveraging DCGM☆1,737May 12, 2026Updated 2 weeks ago
- Kubernetes Operator for AI and Bigdata Elastic Training☆91Jan 10, 2025Updated last year
- Health checks for Azure N- and H-series VMs.☆57May 13, 2026Updated 2 weeks ago
- Heterogeneous GPU Sharing on Kubernetes☆3,475May 22, 2026Updated last week
- ☆14Jul 23, 2018Updated 7 years ago
- Cost-efficient and pluggable Infrastructure components for GenAI inference☆4,824May 21, 2026Updated last week
- Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is t…☆552May 22, 2026Updated last week
- CloudAI Benchmark Framework☆96May 21, 2026Updated last week
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)☆1,925Updated this week
- A Cloud Native Batch System (Project under CNCF)☆5,594Updated this week
- This is a place for various problem detectors running on the Kubernetes nodes.☆3,410Updated this week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆953May 20, 2026Updated last week
- DLRover: An Automatic Distributed Deep Learning System☆1,657Updated this week
- a unified scheduler for online and offline tasks☆669Mar 2, 2026Updated 2 months ago
- ☆259Updated this week