scitix / sichekLinks
Sichek is a tool for detecting and diagnosing node-level issues in AI environments, ensuring the reliability and high performance of GPU-intensive workloads. It proactively identifies hardware and software problems, and triggers automated corrective actions, including task retries and operational maintenance timely
☆12Updated 2 months ago
Alternatives and similar repositories for sichek
Users that are interested in sichek are comparing it to the libraries listed below
Sorting:
- ☆299Updated this week
- Kubernetes Operator for AI and Bigdata Elastic Training☆88Updated 9 months ago
- ☆535Updated last year
- HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container☆221Updated 3 weeks ago
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- ☆132Updated 4 years ago
- RDMA device plugin for Kubernetes☆219Updated last year
- A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod☆127Updated 3 years ago
- Device plugins for Volcano, e.g. GPU☆129Updated 6 months ago
- Run your deep learning workloads on Kubernetes more easily and efficiently.☆531Updated last year
- Arks is a cloud-native inference framework running on Kubernetes☆43Updated last week
- Kubernetes Scheduler for Deep Learning☆262Updated 3 years ago
- Infiniband Verbs Performance Tests☆826Updated 3 weeks ago
- ☆54Updated 2 weeks ago
- Using CRDs to manage GPU resources in Kubernetes.☆209Updated 2 years ago
- MIG Partition Editor for NVIDIA GPUs☆215Updated last week
- ☆884Updated last year
- GPU Sharing Device Plugin for Kubernetes Cluster☆488Updated 2 years ago
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆496Updated 2 weeks ago
- A federation scheduler for multi-cluster☆54Updated 3 months ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆594Updated last month
- Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs☆355Updated 3 months ago
- Common APIs and libraries shared by other Kubeflow operator repositories.☆53Updated 2 years ago
- ☆121Updated 2 years ago
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆115Updated last week
- NVIDIA Network Operator☆284Updated this week
- ☆22Updated 6 months ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆167Updated last year
- NVIDIA NCCL Tests for Distributed Training☆112Updated last week
- Yoda is a kubernetes scheduler based on GPU metrics. Yoda是一个基于GPU参数指标的 Kubernetes 调度器☆138Updated 3 years ago