NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments
☆191Feb 26, 2026Updated last week
Alternatives and similar repositories for NVSentinel
Users that are interested in NVSentinel are comparing it to the libraries listed below
Sorting:
- A toolkit for discovering cluster network topology.☆101Updated this week
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆74Jul 18, 2025Updated 7 months ago
- ☆10Dec 10, 2024Updated last year
- Validation Generation for Kubeflow CRD on Kubernetes☆11Jan 25, 2021Updated 5 years ago
- ☆16Jul 18, 2025Updated 7 months ago
- Incubating P/D sidecar for llm-d☆16Nov 13, 2025Updated 3 months ago
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆35Updated this week
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆673Feb 26, 2026Updated last week
- 中国开发者活动日程(关注点:开源、开发者、云原生)☆23Feb 25, 2026Updated last week
- ☆20Feb 19, 2026Updated 2 weeks ago
- Triton backend for managing the model state tensors automatically in sequence batcher☆17Feb 12, 2024Updated 2 years ago
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆384Updated this week
- ☆11May 22, 2017Updated 8 years ago
- A light weight vLLM simulator, for mocking out replicas.☆87Updated this week
- Smart Kubernetes Scheduling☆83Feb 27, 2026Updated last week
- Kubernetes-native AI serving platform for scalable model serving.☆233Feb 28, 2026Updated last week
- Example DRA driver that developers can fork and modify to get them started writing their own.☆122Feb 23, 2026Updated last week
- GenAI inference performance benchmarking tool☆151Feb 27, 2026Updated last week
- An Envoy inspired, ultimate LLM-first gateway for LLM serving and downstream application developers and enterprises☆26Apr 24, 2025Updated 10 months ago
- A workload for deploying LLM inference services on Kubernetes☆179Updated this week
- Golang library for managing resctrl filesystem☆49Updated this week
- A Model Context Protocol (MCP) server that enables AI assistants to interact with Kubernetes clusters. It serves as a bridge between AI t…☆51Feb 26, 2026Updated last week
- Slides, videos, and supporting files for my public talks☆35Feb 27, 2026Updated last week
- Node Resource Interface☆366Feb 27, 2026Updated last week
- WG Serving☆34Dec 15, 2025Updated 2 months ago
- Run Slurm in Kubernetes☆362Updated this week
- NVIDIA DRA Driver for GPUs☆579Updated this week
- Run Slurm as a Kubernetes scheduler. A Slinky project.☆66Feb 24, 2026Updated last week
- ☆38Oct 16, 2025Updated 4 months ago
- ☆47Dec 8, 2025Updated 2 months ago
- ☆33Updated this week
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 7 months ago
- A distributed system for Agentic AI☆47Updated this week
- ☆11Sep 21, 2022Updated 3 years ago
- An eBPF kernel Observable Agent To Spy Performance Issue On OS.☆13Oct 31, 2025Updated 4 months ago
- This repository contains a Kubernetes controller that manages node taints based on multiple readiness conditions, providing fine-grained …☆110Updated this week
- A QA system based on k8s-specific knowledge build on ChatGLM2-6B, serving by Ray.☆10Sep 14, 2023Updated 2 years ago
- ☆11Aug 27, 2019Updated 6 years ago
- QueueIT Cloudfront Connector (Known User Implementation v.3.x for Cloudfront)☆10Jul 11, 2025Updated 7 months ago