nebius / soperator
Run Slurm in Kubernetes
☆193Updated this week
Alternatives and similar repositories for soperator:
Users that are interested in soperator are comparing it to the libraries listed below
- JobSet: a k8s native API for distributed ML training and HPC workloads☆194Updated this week
- This project provides a framework that runs Slurm in Kubernetes.☆65Updated this week
- A Slurm cluster for Kubernetes☆55Updated 7 months ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆89Updated this week
- ☆34Updated this week
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆88Updated this week
- Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes☆330Updated last week
- MIG Partition Editor for NVIDIA GPUs☆190Updated last week
- CUDA checkpoint and restore utility☆310Updated last month
- K8s device plugin for GPU sharing☆100Updated last year
- Slurm in Kubernetes☆42Updated 3 months ago
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆339Updated this week
- ☆112Updated last week
- This repo includes everything you need to know about deploying GPU nodes on OCI☆25Updated this week
- elastic-gpu-scheduler is a Kubernetes scheduler extender for GPU resources scheduling.☆140Updated 2 years ago
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆63Updated last week
- ☆23Updated last month
- NVIDIA NCCL Tests for Distributed Training☆85Updated last week
- ☆94Updated 2 months ago
- GPU plugin to the node feature discovery for Kubernetes☆298Updated 9 months ago
- ☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!☆99Updated this week
- Holistic job manager on Kubernetes☆112Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆286Updated this week
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆469Updated last week
- NVIDIA device plugin for Kubernetes☆47Updated last year
- A toolkit for discovering cluster network topology.☆39Updated this week
- ☆248Updated this week
- Module, Model, and Tensor Serialization/Deserialization☆220Updated last month
- Practical GPU Sharing Without Memory Size Constraints☆258Updated 6 months ago
- Device plugins for Volcano, e.g. GPU☆116Updated 6 months ago