NVIDIA/gpu-monitoring-tools

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/gpu-monitoring-tools)

NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux

☆1,075

Alternatives and similar repositories for gpu-monitoring-tools

Users that are interested in gpu-monitoring-tools are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

AliyunContainerService / gpushare-scheduler-extender
View on GitHub
GPU Sharing Scheduler for Kubernetes Cluster
☆1,535Dec 29, 2023Updated 2 years ago
AliyunContainerService / gpushare-device-plugin
View on GitHub
GPU Sharing Device Plugin for Kubernetes Cluster
☆496Jan 10, 2023Updated 3 years ago
NVIDIA / k8s-device-plugin
View on GitHub
NVIDIA device plugin for Kubernetes
☆3,820Updated this week
NVIDIA / gpu-feature-discovery
View on GitHub
GPU plugin to the node feature discovery for Kubernetes
☆309May 27, 2024Updated 2 years ago
NVIDIA / nvidia-container-runtime
View on GitHub
NVIDIA container runtime
☆1,127Oct 27, 2023Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
NVIDIA / gpu-operator
View on GitHub
NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
☆2,795Updated this week
tkestack / gpu-manager
View on GitHub
☆903Apr 2, 2024Updated 2 years ago
NVIDIA / go-nvml
View on GitHub
Go Bindings for the NVIDIA Management Library (NVML)
☆447Updated this week
tkestack / vcuda-controller
View on GitHub
☆544Jun 7, 2024Updated 2 years ago
NVIDIA / dcgm-exporter
View on GitHub
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
☆1,808Updated this week
NVIDIA / libnvidia-container
View on GitHub
NVIDIA container runtime library
☆1,117Updated this week
NVIDIA / deepops
View on GitHub
Tools for building GPU clusters
☆1,462Updated this week
NVIDIA / DCGM
View on GitHub
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
☆762Jul 6, 2026Updated last week
mindprince / nvidia_gpu_prometheus_exporter
View on GitHub
NVIDIA GPU Prometheus Exporter
☆253Jul 15, 2021Updated 5 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
volcano-sh / volcano
View on GitHub
A Cloud Native Batch System (Project under CNCF)
☆5,785Updated this week
Mellanox / k8s-rdma-sriov-dev-plugin
View on GitHub
Kubernetes Rdma SRIOV device plugin
☆114Dec 30, 2020Updated 5 years ago
NVIDIA / go-dcgm
View on GitHub
Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
☆156Jul 8, 2026Updated last week
kubernetes-retired / kube-batch
View on GitHub
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
☆1,089May 22, 2023Updated 3 years ago
tkestack / gpu-admission
View on GitHub
☆131Apr 19, 2021Updated 5 years ago
NVIDIA / nvidia-docker
View on GitHub
Build and run Docker containers leveraging NVIDIA GPUs
☆17,581Dec 6, 2023Updated 2 years ago
microsoft / hivedscheduler
View on GitHub
Kubernetes Scheduler for Deep Learning
☆263May 22, 2022Updated 4 years ago
BugRoger / nvidia-exporter
View on GitHub
Prometheus Exporter for NVIDIA GPUs using NVML
☆80Jun 27, 2020Updated 6 years ago
kubedl-io / kubedl
View on GitHub
Run your deep learning workloads on Kubernetes more easily and efficiently.
☆532Mar 4, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
kubernetes-sigs / node-feature-discovery
View on GitHub
Node feature discovery for Kubernetes
☆1,055Updated this week
pokerfaceSad / GPUMounter
View on GitHub
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
☆127Feb 23, 2022Updated 4 years ago
hustcat / k8s-rdma-device-plugin
View on GitHub
RDMA device plugin for Kubernetes
☆226Dec 15, 2023Updated 2 years ago
kubernetes-sigs / scheduler-plugins
View on GitHub
Repository for out-of-tree scheduler plugins based on scheduler framework.
☆1,303Jul 9, 2026Updated last week
kubeflow / arena
View on GitHub
A CLI for Kubeflow.
☆815Updated this week
NTHU-LSALAB / KubeShare
View on GitHub
Share GPU between Pods in Kubernetes
☆217Feb 6, 2023Updated 3 years ago
NVIDIA / go-gpuallocator
View on GitHub
Go Abstraction for Allocating NVIDIA GPUs with Custom Policies
☆123Updated this week
kubeflow / trainer
View on GitHub
Distributed AI Model Training and LLM Fine-Tuning on Kubernetes
☆2,152Updated this week
EBD-CREST / mrCUDA
View on GitHub
An extension of rCUDA that enables remote-to-local GPU migration
☆41Sep 28, 2016Updated 9 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
NVIDIA / nccl
View on GitHub
Optimized primitives for collective multi-GPU communication
☆4,892Updated this week
kubeflow / mpi-operator
View on GitHub
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
☆530Jul 13, 2026Updated last week
AliyunContainerService / et-operator
View on GitHub
Kubernetes Operator for AI and Bigdata Elastic Training
☆91Jan 10, 2025Updated last year
virtaitech / orion
View on GitHub
☆278Jul 6, 2023Updated 3 years ago
kubernetes / node-problem-detector
View on GitHub
This is a place for various problem detectors running on the Kubernetes nodes.
☆3,432Updated this week
microsoft / frameworkcontroller
View on GitHub
General-Purpose Kubernetes Pod Controller
☆170Apr 4, 2023Updated 3 years ago
awslabs / aws-virtual-gpu-device-plugin
View on GitHub
AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads
☆203Nov 22, 2023Updated 2 years ago