lopentusska / slurm_ubuntu_gpu_cluster
Instructions for setting up a Slurm gpu cluster on Ubuntu 22.04.
☆25Updated last year
Alternatives and similar repositories for slurm_ubuntu_gpu_cluster:
Users that are interested in slurm_ubuntu_gpu_cluster are comparing it to the libraries listed below
- Instructions for setting up a SLURM cluster using Ubuntu 18.04.3 with GPUs.☆150Updated 4 years ago
- NVIDIA NCCL Tests for Distributed Training☆88Updated this week
- Prometheus exporter for performance metrics from Slurm.☆249Updated 10 months ago
- Open source web interface for Slurm HPC & AI clusters☆408Updated this week
- Container plugin for Slurm Workload Manager☆336Updated 5 months ago
- MIG Partition Editor for NVIDIA GPUs☆196Updated last week
- Tools for building GPU clusters☆1,338Updated 2 weeks ago
- My tools for the Slurm HPC workload manager☆498Updated this week
- A Slurm cluster using docker-compose☆367Updated 6 months ago
- NCCL Tests☆1,081Updated this week
- Slurm in Docker - Exploring Slurm using CentOS 7 based Docker images☆129Updated 5 years ago
- Ansible role for installing and managing the Slurm Workload Manager☆101Updated 2 weeks ago
- A dummy's guide to setting up (and using) HPC clusters on Ubuntu 22.04LTS using Slurm and Munge. Created by the Quant Club @ UIowa.☆302Updated last year
- Slurm-Mail is a drop in replacement for Slurm's e-mails to give users much more information about their jobs compared to the standard Slu…☆105Updated this week
- An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.☆242Updated this week
- Export select slurm metrics to prometheus☆49Updated last month
- Steps to create a small slurm cluster with GPU enabled nodes☆270Updated 2 years ago
- Jobstats is a job monitoring platform for CPU and GPU clusters☆70Updated last week
- pytorch code examples for measuring the performance of collective communication calls in AI workloads☆16Updated 5 months ago
- Metastack: an enhanced and performance optimized version of Slurm☆51Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆126Updated 3 weeks ago
- FlagScale is a large model toolkit based on open-sourced projects.☆268Updated this week
- KvikIO - High Performance File IO☆206Updated this week
- Tutorial for installing Open XDMoD, OnDemand, & ColdFront☆139Updated last month
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆474Updated last week
- Python Interface to Slurm☆511Updated 2 months ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆494Updated 2 months ago
- Supercomputing. Seamlessly. Open, Interactive HPC Via the Web☆333Updated this week
- ☆303Updated 8 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆342Updated this week