Tools for building GPU clusters
☆1,420Feb 20, 2026Updated last week
Alternatives and similar repositories for deepops
Users that are interested in deepops are comparing it to the libraries listed below
Sorting:
- Container plugin for Slurm Workload Manager☆416Feb 18, 2026Updated last week
- NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes☆2,549Updated this week
- NVIDIA device plugin for Kubernetes☆3,671Updated this week
- A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.☆901Feb 18, 2026Updated last week
- Tools for monitoring NVIDIA GPUs on Linux☆1,068Nov 2, 2021Updated 4 years ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆671Feb 17, 2026Updated last week
- The Triton Inference Server provides an optimized cloud and edge inferencing solution.☆10,375Feb 21, 2026Updated last week
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆515Feb 17, 2026Updated last week
- Open source web interface for Slurm HPC & AI clusters☆545Feb 13, 2026Updated 2 weeks ago
- Prometheus exporter for performance metrics from Slurm.☆275Jun 20, 2024Updated last year
- Slurm: A Highly Scalable Workload Manager☆3,750Feb 20, 2026Updated last week
- An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.☆290Updated this week
- LBNL Node Health Check☆271Apr 18, 2025Updated 10 months ago
- AIStore: scalable storage for AI applications☆1,766Updated this week
- Scheduling GPU cluster workloads with Slurm☆78Nov 5, 2018Updated 7 years ago
- My tools for the Slurm HPC workload manager☆569Updated this week
- GPU Sharing Scheduler for Kubernetes Cluster☆1,528Dec 29, 2023Updated 2 years ago
- HPC Container Maker☆507Feb 19, 2026Updated last week
- Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.☆14,675Dec 1, 2025Updated 2 months ago
- Singularity implementation of k8s operator for interacting with SLURM.☆117Dec 29, 2020Updated 5 years ago
- Ansible role for installing and managing the Slurm Workload Manager☆113Nov 24, 2025Updated 3 months ago
- Resource scheduling and cluster management for AI☆2,687Jun 6, 2024Updated last year
- NVIDIA Network Operator☆325Updated this week
- MIG Partition Editor for NVIDIA GPUs☆241Updated this week
- GPU plugin to the node feature discovery for Kubernetes☆307May 27, 2024Updated last year
- Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes☆5,135Updated this week
- Tools to deploy GPU clusters in the Cloud☆34Apr 4, 2023Updated 2 years ago
- A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep lear…☆5,634Feb 19, 2026Updated last week
- Machine Learning Toolkit for Kubernetes☆15,462Jan 5, 2026Updated last month
- Steps to create a small slurm cluster with GPU enabled nodes☆271Feb 2, 2023Updated 3 years ago
- ☆74Updated this week
- NVIDIA container runtime library☆1,072Updated this week
- Build and run Docker containers leveraging NVIDIA GPUs☆17,498Dec 6, 2023Updated 2 years ago
- A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch☆8,926Feb 16, 2026Updated last week
- NVIDIA container runtime☆1,124Oct 27, 2023Updated 2 years ago
- Ansible role for OpenHPC☆51Updated this week
- Optimized primitives for collective multi-GPU communication☆4,474Updated this week
- Kubernetes Scheduler for Deep Learning☆264May 22, 2022Updated 3 years ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆129Updated this week