Azure / azurehpc-health-checksLinks
Health checks for Azure N- and H-series VMs.
☆44Updated last month
Alternatives and similar repositories for azurehpc-health-checks
Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below
Sorting:
- NVIDIA NCCL Tests for Distributed Training☆93Updated last week
- A tool to detect infrastructure issues on cloud native AI systems☆37Updated last month
- Azure HPC/AI VM Images☆110Updated this week
- MIG Partition Editor for NVIDIA GPUs☆201Updated this week
- Cloud Native Benchmarking of Foundation Models☆36Updated last week
- ☆267Updated last week
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆67Updated last month
- NVIDIA Network Operator☆254Updated last week
- RDMA CNI plugin for containerized workloads☆53Updated this week
- ☆43Updated last year
- ☆62Updated 2 weeks ago
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- Intelligent platform for AI workloads☆37Updated 2 years ago
- GenAI inference performance benchmarking tool☆55Updated 2 weeks ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆83Updated last year
- Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies☆179Updated 2 months ago
- A toolkit for discovering cluster network topology.☆54Updated last week
- A light weight vLLM simulator, for mocking out replicas.☆24Updated last week
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆30Updated 5 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆98Updated 2 months ago
- An I/O benchmark for deep Learning applications☆87Updated 3 weeks ago
- DPDK & SR-IOV CNI plugin☆19Updated this week
- A command line utility to manage the configuration of a system's high performance network interfaces for RoCE deployments☆28Updated last year
- NCCL Profiling Kit☆137Updated 11 months ago
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆175Updated 4 months ago
- nvloom is a set of tools designed to scalably test MNNVL fabrics.☆18Updated 3 weeks ago
- A federation scheduler for multi-cluster☆44Updated 3 weeks ago
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆114Updated this week
- A Lustre container storage interface that allows Kubernetes to mount/unmount provisioned Lustre filesystems into containers.☆34Updated last month
- Holistic job manager on Kubernetes☆116Updated last year