oracle-quickstart / oci-hpc-okeLinks
This repo includes everything you need to know about deploying GPU nodes on OCI
☆40Updated last week
Alternatives and similar repositories for oci-hpc-oke
Users that are interested in oci-hpc-oke are comparing it to the libraries listed below
Sorting:
- MIG Partition Editor for NVIDIA GPUs☆232Updated last week
- NVIDIA NCCL Tests for Distributed Training☆129Updated this week
- A toolkit for discovering cluster network topology.☆86Updated last week
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆141Updated this week
- CUDA checkpoint and restore utility☆397Updated 3 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆52Updated 3 months ago
- JobSet: a k8s native API for distributed ML training and HPC workloads☆289Updated this week
- Run cloud native workloads on NVIDIA GPUs☆210Updated 2 months ago
- Inference scheduler for llm-d☆110Updated this week
- GenAI inference performance benchmarking tool☆137Updated this week
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆334Updated last week
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆126Updated last week
- ☆173Updated last week
- Run Slurm in Kubernetes☆335Updated this week
- NVIDIA DRA Driver for GPUs☆508Updated last week
- Run Slurm on Kubernetes. A Slinky project.☆208Updated this week
- Golang bindings for Nvidia Datacenter GPU Manager (DCGM)☆143Updated 2 weeks ago
- Health checks for Azure N- and H-series VMs.☆55Updated 2 weeks ago
- llm-d helm charts and deployment examples☆48Updated last week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆464Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆628Updated 2 weeks ago
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆72Updated 5 months ago
- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.☆148Updated this week
- NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated compu…☆121Updated this week
- AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads☆205Updated 2 years ago
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆1,016Updated this week
- A Slurm cluster for Kubernetes☆66Updated last year
- Gateway API Inference Extension☆548Updated this week
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆120Updated last week
- Distributed KV cache coordinator☆92Updated this week