NVIDIA / DCGMLinks

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs

☆551

Alternatives and similar repositories for DCGM

Users that are interested in DCGM are comparing it to the libraries listed below

Sorting:

coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆100Updated last week
NVIDIA / mig-parted
MIG Partition Editor for NVIDIA GPUs
☆204Updated last week
NVIDIA / nvbandwidth
A tool for bandwidth measurements on NVIDIA GPUs.
☆492Updated 3 months ago
Mellanox / nv_peer_memory
☆361Updated last year
microsoft / superbenchmark
A validation and profiling tool for AI infrastructure
☆325Updated last week
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆491Updated this week
NVIDIA / nccl-tests
NCCL Tests
☆1,199Updated last week
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆353Updated 6 months ago
Mellanox / nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
☆199Updated last month
Bruce-Lee-LY / cuda_hook
Hooked CUDA-related dynamic libraries by using automated code generation tools.
☆160Updated last year
leptonai / gpud
GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆401Updated this week
google / nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
☆118Updated last year
Mellanox / k8s-rdma-shared-dev-plugin
☆283Updated last week
tkestack / vcuda-controller
☆533Updated last year
NTHU-LSALAB / Gemini
An efficient GPU resource sharing system with fine-grained control for Linux platforms.
☆84Updated last year
NVIDIA / go-dcgm
Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
☆123Updated this week
NVIDIA / cloud-native-stack
Run cloud native workloads on NVIDIA GPUs
☆188Updated this week
kubeflow / mpi-operator
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
☆487Updated last week
RWTH-ACS / cricket
cricket is a virtualization solution for GPUs
☆211Updated last month
NTHU-LSALAB / KubeShare
Share GPU between Pods in Kubernetes
☆211Updated 2 years ago
alibaba / GPU-scheduler-for-deep-learning
GPU-scheduler-for-deep-learning
☆210Updated 4 years ago
triton-inference-server / triton_distributed
☆52Updated 4 months ago
NVIDIA / dcgm-exporter
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
☆1,302Updated last week
aws / aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
☆181Updated last week
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆194Updated last week
microsoft / hivedscheduler
Kubernetes Scheduler for Deep Learning
☆263Updated 3 years ago
NVIDIA / gpu-driver-container
The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.
☆121Updated last week
NVIDIA / libnvidia-container
NVIDIA container runtime library
☆991Updated last month
NVIDIA / gds-nvidia-fs
NVIDIA GPUDirect Storage Driver
☆272Updated 3 months ago
pokerfaceSad / GPUMounter
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
☆127Updated 3 years ago