oracle-quickstart / oci-hpc-oke
This repo includes everything you need to know about deploying GPU nodes on OCI
☆29Updated last week
Alternatives and similar repositories for oci-hpc-oke
Users that are interested in oci-hpc-oke are comparing it to the libraries listed below
Sorting:
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆103Updated this week
- A Slurm cluster for Kubernetes☆57Updated 9 months ago
- NVIDIA NCCL Tests for Distributed Training☆90Updated last week
- A toolkit for discovering cluster network topology.☆46Updated 2 weeks ago
- MIG Partition Editor for NVIDIA GPUs☆198Updated last week
- GenAI inference performance benchmarking tool☆41Updated this week
- Health checks for Azure N- and H-series VMs.☆40Updated 2 weeks ago
- ☆150Updated last month
- Run Slurm on Kubernetes. A Slinky project.☆88Updated 2 weeks ago
- RDMA CNI plugin for containerized workloads☆52Updated last week
- JobSet: a k8s native API for distributed ML training and HPC workloads☆226Updated last week
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆66Updated last week
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆96Updated this week
- Example DRA driver that developers can fork and modify to get them started writing their own.☆70Updated this week
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆113Updated 9 months ago
- ☆24Updated last week
- Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes☆355Updated this week
- ☆62Updated this week
- IP Over Infiniband (IPoIB) CNI Plugin☆14Updated this week
- Golang bindings for Nvidia Datacenter GPU Manager (DCGM)☆112Updated last month
- ☆248Updated last week
- CUDA checkpoint and restore utility☆333Updated 3 months ago
- ☆110Updated last week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆510Updated 2 weeks ago
- Cloud Native Benchmarking of Foundation Models☆33Updated 6 months ago
- Run cloud native workloads on NVIDIA GPUs☆169Updated last week
- NVIDIA Network Operator☆248Updated this week
- AppWrapper controller for Kueue☆13Updated 2 weeks ago
- Repository for open inference protocol specification☆55Updated this week
- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.☆111Updated last week