NVIDIA/NVSentinel

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/NVSentinel)

NVIDIA / NVSentinel

NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments

☆191

Alternatives and similar repositories for NVSentinel

Users that are interested in NVSentinel are comparing it to the libraries listed below

Sorting:

NVIDIA / topograph
View on GitHub
A toolkit for discovering cluster network topology.
☆101Updated this week
NVIDIA / knavigator
View on GitHub
knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.
☆74Jul 18, 2025Updated 7 months ago
nl2logql / LogQLLM
View on GitHub
☆10Dec 10, 2024Updated last year
kubeflow / crd-validation
View on GitHub
Validation Generation for Kubeflow CRD on Kubernetes
☆11Jan 25, 2021Updated 5 years ago
run-ai / kwok-operator
View on GitHub
☆16Jul 18, 2025Updated 7 months ago
llm-d / llm-d-routing-sidecar
View on GitHub
Incubating P/D sidecar for llm-d
☆16Nov 13, 2025Updated 3 months ago
BaizeAI / kcover
View on GitHub
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
☆35Updated this week
kubernetes-sigs / lws
View on GitHub
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
☆673Feb 26, 2026Updated last week
pacoxu / developers-conferences-agenda
View on GitHub
中国开发者活动日程（关注点：开源、开发者、云原生）
☆23Feb 25, 2026Updated last week
Azure / kperf
View on GitHub
☆20Feb 19, 2026Updated 2 weeks ago
triton-inference-server / stateful_backend
View on GitHub
Triton backend for managing the model state tensors automatically in sequence batcher
☆17Feb 12, 2024Updated 2 years ago
sgl-project / ome
View on GitHub
Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…
☆384Updated this week
SeniYuting / BookStore
View on GitHub
☆11May 22, 2017Updated 8 years ago
llm-d / llm-d-inference-sim
View on GitHub
A light weight vLLM simulator, for mocking out replicas.
☆87Updated this week
schednex-ai / schednex
View on GitHub
Smart Kubernetes Scheduling
☆83Feb 27, 2026Updated last week
volcano-sh / kthena
View on GitHub
Kubernetes-native AI serving platform for scalable model serving.
☆233Feb 28, 2026Updated last week
kubernetes-sigs / dra-example-driver
View on GitHub
Example DRA driver that developers can fork and modify to get them started writing their own.
☆122Feb 23, 2026Updated last week
kubernetes-sigs / inference-perf
View on GitHub
GenAI inference performance benchmarking tool
☆151Feb 27, 2026Updated last week
knoway-dev / knoway
View on GitHub
An Envoy inspired, ultimate LLM-first gateway for LLM serving and downstream application developers and enterprises
☆26Apr 24, 2025Updated 10 months ago
sgl-project / rbg
View on GitHub
A workload for deploying LLM inference services on Kubernetes
☆179Updated this week
intel / goresctrl
View on GitHub
Golang library for managing resctrl filesystem
☆49Updated this week
Azure / mcp-kubernetes
View on GitHub
A Model Context Protocol (MCP) server that enables AI assistants to interact with Kubernetes clusters. It serves as a bridge between AI t…
☆51Feb 26, 2026Updated last week
terrytangyuan / public-talks
View on GitHub
Slides, videos, and supporting files for my public talks
☆35Feb 27, 2026Updated last week
containerd / nri
View on GitHub
Node Resource Interface
☆366Feb 27, 2026Updated last week
kubernetes-sigs / wg-serving
View on GitHub
WG Serving
☆34Dec 15, 2025Updated 2 months ago
nebius / soperator
View on GitHub
Run Slurm in Kubernetes
☆362Updated this week
NVIDIA / k8s-dra-driver-gpu
View on GitHub
NVIDIA DRA Driver for GPUs
☆579Updated this week
SlinkyProject / slurm-bridge
View on GitHub
Run Slurm as a Kubernetes scheduler. A Slinky project.
☆66Feb 24, 2026Updated last week
intel / memtierd
View on GitHub
☆38Oct 16, 2025Updated 4 months ago
OpenCIDN / ocimirror
View on GitHub
☆47Dec 8, 2025Updated 2 months ago
DDNStorage / exa-csi-driver
View on GitHub
☆33Updated this week
fmperf-project / fmperf
View on GitHub
Cloud Native Benchmarking of Foundation Models
☆45Jul 31, 2025Updated 7 months ago
xflops / flame
View on GitHub
A distributed system for Agentic AI
☆47Updated this week
argoproj-labs / argoverse
View on GitHub
☆11Sep 21, 2022Updated 3 years ago
chentao-kernel / spycat
View on GitHub
An eBPF kernel Observable Agent To Spy Performance Issue On OS.
☆13Oct 31, 2025Updated 4 months ago
kubernetes-sigs / node-readiness-controller
View on GitHub
This repository contains a Kubernetes controller that manages node taints based on multiple readiness conditions, providing fine-grained …
☆110Updated this week
kerthcet / k8s-specific-knowledge-base
View on GitHub
A QA system based on k8s-specific knowledge build on ChatGLM2-6B, serving by Ray.
☆10Sep 14, 2023Updated 2 years ago
tohwsw / aws-account-factory
View on GitHub
☆11Aug 27, 2019Updated 6 years ago
queueit / KnownUser.V3.Cloudfront
View on GitHub
QueueIT Cloudfront Connector (Known User Implementation v.3.x for Cloudfront)
☆10Jul 11, 2025Updated 7 months ago