scitix / sichekLinks
Sichek is a tool for detecting and diagnosing node-level issues in AI environments, ensuring the reliability and high performance of GPU-intensive workloads. It proactively identifies hardware and software problems, and triggers automated corrective actions, including task retries and operational maintenance timely
☆15Updated last week
Alternatives and similar repositories for sichek
Users that are interested in sichek are comparing it to the libraries listed below
Sorting:
- ☆328Updated last week
- ☆538Updated last year
- HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container☆262Updated last month
- Arks is a cloud-native inference framework running on Kubernetes☆45Updated last month
- ☆72Updated 2 months ago
- Kubernetes Operator for AI and Bigdata Elastic Training☆90Updated last year
- RDMA device plugin for Kubernetes☆226Updated 2 years ago
- A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod☆127Updated 3 years ago
- ☆132Updated 4 years ago
- Device plugins for Volcano, e.g. GPU☆131Updated 9 months ago
- Kubernetes Rdma SRIOV device plugin☆112Updated 5 years ago
- Using CRDs to manage GPU resources in Kubernetes.☆210Updated 3 years ago
- ☆890Updated last year
- Run your deep learning workloads on Kubernetes more easily and efficiently.☆533Updated last year
- kubernetes device plugin的开发示例☆36Updated 5 years ago
- The VPC-CNI plugin for Volcengine.☆101Updated 5 months ago
- NVIDIA Network Operator☆315Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆642Updated last month
- A federation scheduler for multi-cluster☆59Updated this week
- Pingmesh:A Large-Scale System for Data Center Network Latency Measurement and Analysis(用于数据中心网络延迟测量和分析的大规模系统)☆155Updated last year
- ☆59Updated 2 weeks ago
- Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs☆355Updated 6 months ago
- RDMA and SHARP plugins for nccl library☆218Updated last month
- OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow app…☆582Updated last year
- NVIDIA NCCL Tests for Distributed Training☆132Updated this week
- ☆54Updated 4 months ago
- SRIOV network device plugin for Kubernetes☆495Updated 2 weeks ago
- Yoda is a kubernetes scheduler based on GPU metrics. Yoda是一个基于GPU参数指标的 Kubernetes 调度器☆137Updated 3 years ago
- Infiniband Verbs Performance Tests☆896Updated last week
- Make underlay and overlay network can coexist, communicate, even be transformed purposefully.☆271Updated last year