Azure / azurehpc-health-checksLinks
Health checks for Azure N- and H-series VMs.
☆48Updated this week
Alternatives and similar repositories for azurehpc-health-checks
Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below
Sorting:
- NVIDIA NCCL Tests for Distributed Training☆105Updated this week
- A tool to detect infrastructure issues on cloud native AI systems☆45Updated 3 weeks ago
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- ☆43Updated last year
- ☆286Updated this week
- MIG Partition Editor for NVIDIA GPUs☆209Updated this week
- ☆64Updated last week
- NVIDIA Network Operator☆270Updated this week
- A toolkit for discovering cluster network topology.☆63Updated last week
- Cloud Native Benchmarking of Foundation Models☆40Updated 3 weeks ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆84Updated last year
- RDMA CNI plugin for containerized workloads☆55Updated 2 weeks ago
- GenAI inference performance benchmarking tool☆74Updated last week
- DPDK & SR-IOV CNI plugin☆19Updated last week
- Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies☆177Updated last month
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆69Updated last month
- An I/O benchmark for deep Learning applications☆90Updated 2 months ago
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆31Updated last week
- CUDA checkpoint and restore utility☆360Updated 6 months ago
- Azure HPC/AI VM Images☆115Updated this week
- RDMA device plugin for Kubernetes☆218Updated last year
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆116Updated last week
- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.☆126Updated last week
- RDMA and SHARP plugins for nccl library☆200Updated 2 months ago
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆176Updated 6 months ago
- cricket is a virtualization solution for GPUs☆213Updated 2 months ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆165Updated last year
- Holistic job manager on Kubernetes☆116Updated last year
- Persistent Memory Container Storage Interface Driver☆163Updated 10 months ago
- ☆25Updated 3 weeks ago