leptonai / gpudLinks

GPUd automates monitoring, diagnostics, and issue identification for GPUs

☆401

Alternatives and similar repositories for gpud

Users that are interested in gpud are comparing it to the libraries listed below

Sorting:

sgl-project / ome
OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)
☆192Updated last week
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆100Updated last week
imbue-ai / cluster-health
☆313Updated 11 months ago
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆353Updated 6 months ago
InftyAI / llmaz
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
☆228Updated this week
pokerfaceSad / GPUMounter
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
☆127Updated 3 years ago
Project-HAMi / HAMi-core
HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container
☆191Updated last week
Mellanox / k8s-rdma-shared-dev-plugin
☆283Updated last week
NVIDIA / DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
☆551Updated 2 months ago
kubernetes-sigs / lws
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
☆526Updated this week
bytedance / InfiniStore
KV cache store for distributed LLM inference
☆303Updated last month
volcano-sh / devices
Device plugins for Volcano, e.g. GPU
☆126Updated 4 months ago
NVIDIA / go-dcgm
Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
☆123Updated this week
elastic-ai / elastic-gpu-scheduler
elastic-gpu-scheduler is a Kubernetes scheduler extender for GPU resources scheduling.
☆142Updated 2 years ago
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆491Updated this week
AliyunContainerService / et-operator
Kubernetes Operator for AI and Bigdata Elastic Training
☆87Updated 6 months ago
NVIDIA / go-gpuallocator
Go Abstraction for Allocating NVIDIA GPUs with Custom Policies
☆116Updated last month
NVIDIA / k8s-dra-driver-gpu
NVIDIA DRA Driver for GPUs
☆400Updated last week
BaizeAI / kcover
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
☆31Updated last week
grgalex / nvshare
Practical GPU Sharing Without Memory Size Constraints
☆276Updated 4 months ago
tkestack / vcuda-controller
☆533Updated last year
AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆454Updated this week
NVIDIA / topograph
A toolkit for discovering cluster network topology.
☆59Updated last week
kubernetes-sigs / gateway-api-inference-extension
Gateway API Inference Extension
☆415Updated this week
elastic-ai / elastic-gpu
Using CRDs to manage GPU resources in Kubernetes.
☆206Updated 2 years ago
NVIDIA / mig-parted
MIG Partition Editor for NVIDIA GPUs
☆204Updated last week
Project-HAMi / volcano-vgpu-device-plugin
Device-plugin for volcano vgpu which support hard resource isolation
☆96Updated last month
NVIDIA / gpu-feature-discovery
GPU plugin to the node feature discovery for Kubernetes
☆302Updated last year
run-ai / runai-model-streamer
☆231Updated this week
run-ai / fake-gpu-operator
☆130Updated 2 weeks ago