oracle-quickstart / oci-hpc-okeLinks
This repo includes everything you need to know about deploying GPU nodes on OCI
☆35Updated last week
Alternatives and similar repositories for oci-hpc-oke
Users that are interested in oci-hpc-oke are comparing it to the libraries listed below
Sorting:
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆129Updated this week
- MIG Partition Editor for NVIDIA GPUs☆213Updated 2 weeks ago
- A toolkit for discovering cluster network topology.☆69Updated this week
- NVIDIA DRA Driver for GPUs☆446Updated this week
- NVIDIA NCCL Tests for Distributed Training☆111Updated last week
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆59Updated this week
- JobSet: a k8s native API for distributed ML training and HPC workloads☆262Updated this week
- Run Slurm on Kubernetes. A Slinky project.☆169Updated this week
- Inference scheduler for llm-d☆94Updated last week
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆815Updated last week
- Run cloud native workloads on NVIDIA GPUs☆197Updated last month
- GenAI inference performance benchmarking tool☆97Updated this week
- Golang bindings for Nvidia Datacenter GPU Manager (DCGM)☆133Updated last month
- Run Slurm in Kubernetes☆284Updated last week
- CUDA checkpoint and restore utility☆371Updated last week
- Controller for ModelMesh☆237Updated 3 months ago
- GPU plugin to the node feature discovery for Kubernetes☆305Updated last year
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆273Updated last week
- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.☆132Updated this week
- Distributed KV cache coordinator☆71Updated this week
- Gateway API Inference Extension☆486Updated this week
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆115Updated 3 weeks ago
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆583Updated this week
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆496Updated this week
- Holistic job manager on Kubernetes☆116Updated last year
- ☆254Updated last week
- ☆144Updated last week
- ☆263Updated 3 weeks ago
- Model Registry provides a single pane of glass for ML model developers to index and manage models, versions, and ML artifacts metadata. I…☆150Updated this week
- A Slurm cluster for Kubernetes☆63Updated last year