leptonai / gpudView external linksLinks
GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆475Feb 6, 2026Updated last week
Alternatives and similar repositories for gpud
Users that are interested in gpud are comparing it to the libraries listed below
Sorting:
- ☆323Aug 20, 2024Updated last year
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆74Jul 18, 2025Updated 6 months ago
- NVIDIA NCCL Tests for Distributed Training☆136Jan 27, 2026Updated 2 weeks ago
- A tool to detect infrastructure issues on cloud native AI systems☆52Sep 18, 2025Updated 4 months ago
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆1,127Updated this week
- CUDA checkpoint and restore utility☆415Sep 15, 2025Updated 4 months ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,052Updated this week
- NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes☆2,533Updated this week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆476Feb 3, 2026Updated last week
- NVIDIA DRA Driver for GPUs☆557Feb 6, 2026Updated last week
- ☆193Jan 20, 2026Updated 3 weeks ago
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆35Feb 5, 2026Updated last week
- Go Bindings for the NVIDIA Management Library (NVML)☆424Feb 5, 2026Updated last week
- Kubernetes-native Job Queueing☆2,313Updated this week
- Heterogeneous AI Computing Virtualization Middleware(Project under CNCF)☆3,005Updated this week
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆362Jan 28, 2026Updated 2 weeks ago
- ☆335Updated this week
- NCCL Tests☆1,427Updated this week
- Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)☆1,890Updated this week
- Resource Exporter for volcano scheduling, e.g. NUMA-Aware scheduling.☆19May 30, 2025Updated 8 months ago
- A production-ready remote container image format (overlaybd) and snapshotter based on block-device.☆449Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆262Feb 7, 2026Updated last week
- A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, …☆1,652Feb 5, 2026Updated last week
- Kubernetes Operator for AI and Bigdata Elastic Training☆90Jan 10, 2025Updated last year
- Repository for out-of-tree scheduler plugins based on scheduler framework.☆1,271Dec 5, 2025Updated 2 months ago
- Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is t…☆540Updated this week
- Cost-efficient and pluggable Infrastructure components for GenAI inference☆4,603Feb 5, 2026Updated last week
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆4,701Updated this week
- InfiniBand fabric monitoring daemon written in Go☆32May 22, 2025Updated 8 months ago
- NVIDIA GPU metrics exporter for Prometheus leveraging DCGM☆1,609Feb 4, 2026Updated last week
- DLRover: An Automatic Distributed Deep Learning System☆1,633Updated this week
- A Cloud Native Batch System (Project under CNCF)☆5,320Updated this week
- Gateway API Inference Extension☆583Updated this week
- ☆215Updated this week
- CloudAI Benchmark Framework☆83Feb 6, 2026Updated last week
- Health checks for Azure N- and H-series VMs.☆57Feb 5, 2026Updated last week
- NVIDIA device plugin for Kubernetes☆3,662Updated this week
- A toolkit to run Ray applications on Kubernetes☆2,319Updated this week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆922Updated this week