Azure/azurehpc-health-checks

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Azure/azurehpc-health-checks)

Azure / azurehpc-health-checks

Health checks for Azure N- and H-series VMs.

☆59

Alternatives and similar repositories for azurehpc-health-checks

Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mej / nhc
View on GitHub
LBNL Node Health Check
☆284Apr 7, 2026Updated 3 months ago
Azure / Moneo
View on GitHub
Distributed AI/HPC Monitoring Framework
☆29Apr 11, 2025Updated last year
nydusaccelerator / hello-bench
View on GitHub
Container startup benchmark tool
☆12Apr 10, 2023Updated 3 years ago
Azure / azlustre
View on GitHub
Azure ARM Template for Lustre filesystem deployment
☆11Mar 13, 2023Updated 3 years ago
microsoft / hermes-ndv2
View on GitHub
Messaging library on top of NDv2 (Microsoft's RDMA interface)
☆13Jun 12, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
xjdr-alt / mla_blog_translation
View on GitHub
☆13Jun 18, 2024Updated 2 years ago
imbue-ai / cluster-health
View on GitHub
Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.
☆323Aug 20, 2024Updated last year
jeefy / slurmnetes
View on GitHub
A project to apply a traditional implementation of Slurm on Kubernetes (with some magic)
☆11Dec 20, 2017Updated 8 years ago
NVIDIA / k8s-operator-libs
View on GitHub
A collection of useful Go libraries to ease the development of NVIDIA Operators for GPU/NIC management.
☆30Updated this week
GoogleCloudPlatform / cluster-health-scanner
View on GitHub
☆36Oct 31, 2025Updated 8 months ago
opencomputeproject / OCP-Multipath-Reliable-Connection
View on GitHub
Multipath Reliable Connection (MRC) extends InfiniBand Reliable Connection semantics so a single RDMA connection can spray traffic across…
☆21Jun 8, 2026Updated last month
NVIDIA / nvbandwidth
View on GitHub
A tool for bandwidth measurements on NVIDIA GPUs.
☆735Updated this week
edwardsp / lemur
View on GitHub
Lustre HSM tools
☆10Feb 19, 2024Updated 2 years ago
NVIDIA / topograph
View on GitHub
A toolkit for discovering cluster network topology.
☆145Updated this week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
adrianwyatt / dalle-picture-frame
View on GitHub
☆12Apr 17, 2023Updated 3 years ago
Azure / aks-rdma-infiniband
View on GitHub
⚡ Guidance, samples, and tools for HPC workloads on AKS clusters with RDMA and InfiniBand support, including GPUDirect RDMA.
☆23Updated this week
Azure / azure-batch-cli-extensions
View on GitHub
Batch extension cli commands for Azure cli v2
☆13Mar 6, 2024Updated 2 years ago
NVIDIA / pika
View on GitHub
API for coordinating Maintenance in Kubernetes.
☆26Jul 2, 2026Updated 2 weeks ago
FindHao / drgpu
View on GitHub
A Top-Down Profiler for GPU Applications
☆23Feb 29, 2024Updated 2 years ago
juan-lee / knode
View on GitHub
knode uses a kubernetes daemonset for node configuration.
☆20Oct 28, 2020Updated 5 years ago
guilbaults / infiniband-exporter
View on GitHub
Prometheus exporter for a Infiniband Fabric
☆70Dec 12, 2023Updated 2 years ago
kubeflow / crd-validation
View on GitHub
Validation Generation for Kubeflow CRD on Kubernetes
☆11Jan 25, 2021Updated 5 years ago
Azure / cyclecloud-slurm
View on GitHub
Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
☆84Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ubccr / xdmod-supremm
View on GitHub
The Job Performance (SUPReMM) module for Open XDMoD.
☆12Apr 2, 2026Updated 3 months ago
MemVerge / pmts
View on GitHub
Persistent Memory Test Suite
☆14Apr 29, 2020Updated 6 years ago
NVIDIA / dgxc-benchmarking
View on GitHub
DGXC Benchmarking provides recipes in ready-to-use templates for evaluating performance of specific AI use cases across hardware and soft…
☆98Jul 6, 2026Updated 2 weeks ago
flavio / krew-wasm
View on GitHub
krew-wasm offers a way to write and distribute kubectl plugins based on WebAssembly
☆14Apr 15, 2024Updated 2 years ago
llm-d-incubation / llm-d-planner
View on GitHub
☆25Updated this week
u-root / iscsinl
View on GitHub
Go iSCSI initiator netlink library
☆15Feb 25, 2023Updated 3 years ago
Azure / azurehpc
View on GitHub
This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environm…
☆133Oct 31, 2024Updated last year
k8snetworkplumbingwg / ib-sriov-cni
View on GitHub
InfiniBand SR-IOV CNI
☆58Updated this week
infiniband-radar / infiniband-radar-daemon
View on GitHub
☆15Nov 25, 2021Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Azure / azure-hpc
View on GitHub
Microsoft Azure HPC & Big Compute
☆15Aug 18, 2019Updated 6 years ago
risingwavelabs / memcomparable
View on GitHub
A memcomparable serialization format.
☆23May 16, 2023Updated 3 years ago
coreweave / nccl-tests
View on GitHub
NVIDIA NCCL Tests for Distributed Training
☆149Jul 8, 2026Updated last week
infiniband-radar / infiniband-radar-web
View on GitHub
Monitoring and visualization of InfiniBand Fabrics
☆23Apr 19, 2021Updated 5 years ago
Mellanox / k8s-rdma-shared-dev-plugin
View on GitHub
☆375Updated this week
NVIDIA / Fabric-Manager-Client
View on GitHub
This is a tool for managing GPU partitions for NVIDIA Fabric Manager’s Shared NVSwitch.
☆17Jul 2, 2026Updated 2 weeks ago
NVIDIA / GPUStressTest
View on GitHub
GPU Stress Test is a tool to stress the compute engine of NVIDIA Tesla GPU’s by running a BLAS matrix multiply using different data types…
☆126Jul 8, 2025Updated last year