Azure / azurehpc-health-checksView external linksLinks
Health checks for Azure N- and H-series VMs.
☆57Feb 5, 2026Updated last week
Alternatives and similar repositories for azurehpc-health-checks
Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below
Sorting:
- Azure HPC/AI VM Images☆126Updated this week
- LBNL Node Health Check☆269Apr 18, 2025Updated 10 months ago
- The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution☆67Oct 31, 2025Updated 3 months ago
- Prometheus collector and exporter for Slurm cluster metrics. A Slinky project.☆15Nov 7, 2025Updated 3 months ago
- A collection of useful Go libraries to ease the development of NVIDIA Operators for GPU/NIC management.☆29Updated this week
- Container startup benchmark tool☆12Apr 10, 2023Updated 2 years ago
- A project to apply a traditional implementation of Slurm on Kubernetes (with some magic)☆11Dec 20, 2017Updated 8 years ago
- The Job Performance (SUPReMM) module for Open XDMoD.☆11Jan 30, 2026Updated 2 weeks ago
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆35Updated this week
- Distributed AI/HPC Monitoring Framework☆29Apr 11, 2025Updated 10 months ago
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 6 months ago
- ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage☆70Feb 6, 2026Updated last week
- ☆323Aug 20, 2024Updated last year
- GPU Admin Tools. Includes Confidential Computing controls for H100, and other functionality☆65Dec 2, 2025Updated 2 months ago
- A toolkit for discovering cluster network topology.☆98Updated this week
- ☆15Nov 25, 2021Updated 4 years ago
- scalable data movement in Exascale Supercomputers☆17Dec 4, 2025Updated 2 months ago
- SMT-LIB benchmarks for shape computations from deep learning models in PyTorch☆18Dec 21, 2022Updated 3 years ago
- OpenAPI Golang client library for Slurm REST API. A Slinky project.☆21Updated this week
- Monitoring and visualization of InfiniBand Fabrics☆23Apr 19, 2021Updated 4 years ago
- NVIDIA NCCL Tests for Distributed Training☆136Jan 27, 2026Updated 3 weeks ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆623Apr 15, 2025Updated 10 months ago
- MPI Benchmark on AWS HPC cluster☆20Jan 31, 2020Updated 6 years ago
- API for coordinating Maintenance in Kubernetes.☆26Jul 18, 2025Updated 6 months ago
- GenAI inference performance benchmarking tool☆145Feb 6, 2026Updated last week
- Using C++ magic to capture CUDA kernels and tune them with Kernel Tuner☆21Sep 12, 2025Updated 5 months ago
- Overcoming the IOTLB Wall for Multi-100-Gbps Linux-based Networking☆24May 16, 2023Updated 2 years ago
- Multi-network CRD specification☆52Apr 11, 2024Updated last year
- knode uses a kubernetes daemonset for node configuration.☆20Oct 28, 2020Updated 5 years ago
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- llm-d benchmark scripts and tooling☆47Updated this week
- A Kubernetes Operator to manage Node OS customizations.☆40Feb 11, 2026Updated last week
- Command-line tool to retrieve information and monitor Mellanox un-managed Infiniband switches☆74Nov 17, 2025Updated 3 months ago
- ☆36Sep 1, 2025Updated 5 months ago
- GPU Stress Test is a tool to stress the compute engine of NVIDIA Tesla GPU’s by running a BLAS matrix multiply using different data types…☆119Jul 8, 2025Updated 7 months ago
- Bare Metal Provisioning system for HPC Linux clusters☆68Feb 4, 2026Updated last week
- Artifacts of EVT ASPLOS'24☆29Mar 6, 2024Updated last year
- Prometheus exporter for a Infiniband Fabric☆69Dec 12, 2023Updated 2 years ago
- Asynchronous Rust bindings for UCX☆78Apr 29, 2025Updated 9 months ago