NVIDIA / ais-k8s
Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.
☆89Updated this week
Alternatives and similar repositories for ais-k8s:
Users that are interested in ais-k8s are comparing it to the libraries listed below
- Run cloud native workloads on NVIDIA GPUs☆164Updated this week
- Holistic job manager on Kubernetes☆112Updated last year
- MIG Partition Editor for NVIDIA GPUs☆190Updated this week
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆88Updated this week
- A top-like tool for monitoring GPUs in a cluster☆86Updated last year
- GPU plugin to the node feature discovery for Kubernetes☆298Updated 9 months ago
- Repository for open inference protocol specification☆49Updated 8 months ago
- JobSet: a k8s native API for distributed ML training and HPC workloads☆194Updated this week
- InstaSlice facilitates the use of Dynamic Resource Allocation (DRA) on Kubernetes clusters for GPU sharing☆27Updated 3 months ago
- Provides for deploying custom ETL containers on AIStore, with subsequent user-defined extraction-transformation-loading in parallel, on t…☆16Updated 2 weeks ago
- Controller for ModelMesh☆225Updated 3 weeks ago
- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.☆99Updated this week
- K8s device plugin for GPU sharing☆100Updated last year
- A Slurm cluster for Kubernetes☆55Updated 7 months ago
- Run Slurm in Kubernetes☆185Updated this week
- This project provides a framework that runs Slurm in Kubernetes.☆64Updated 2 weeks ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆480Updated last month
- ☆23Updated last month
- AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads☆204Updated last year
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆469Updated last week
- Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes☆330Updated this week
- NVIDIA NCCL Tests for Distributed Training☆84Updated last week
- markdown docs☆82Updated this week
- CUDA checkpoint and restore utility☆306Updated last month
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆198Updated 2 months ago
- NVIDIA Network Operator☆243Updated this week
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆174Updated last month
- A Topology-Aware Custom Scheduler For Kubernetes☆63Updated last year
- Distributed Model Serving Framework☆159Updated last week
- Fork of NVIDIA device plugin for Kubernetes with support for shared GPUs by declaring GPUs multiple times☆88Updated 2 years ago