Azure / azurehpc-health-checksLinks
Health checks for Azure N- and H-series VMs.
☆54Updated 2 weeks ago
Alternatives and similar repositories for azurehpc-health-checks
Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below
Sorting:
- A tool to detect infrastructure issues on cloud native AI systems☆48Updated last month
- NVIDIA NCCL Tests for Distributed Training☆116Updated this week
- ☆43Updated last year
- Cloud Native Benchmarking of Foundation Models☆44Updated 2 months ago
- ☆25Updated last week
- A toolkit for discovering cluster network topology.☆74Updated this week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆85Updated last year
- ☆66Updated this week
- MIG Partition Editor for NVIDIA GPUs☆219Updated this week
- A workload for deploying LLM inference services on Kubernetes☆87Updated this week
- NVIDIA Network Operator☆285Updated this week
- Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies☆177Updated 3 months ago
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- RDMA CNI plugin for containerized workloads☆58Updated 2 weeks ago
- Holistic job manager on Kubernetes☆116Updated last year
- ☆303Updated last week
- GenAI inference performance benchmarking tool☆106Updated last week
- llm-d benchmark scripts and tooling☆30Updated this week
- CUDA checkpoint and restore utility☆377Updated last month
- A command line utility to manage the configuration of a system's high performance network interfaces for RoCE deployments☆32Updated 2 years ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆134Updated 6 months ago
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆116Updated last month
- Prometheus exporter for a Infiniband Fabric☆67Updated last year
- An I/O benchmark for deep Learning applications☆91Updated this week
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆70Updated 3 months ago
- DPDK & SR-IOV CNI plugin☆19Updated last week
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆176Updated 8 months ago
- NCCL Profiling Kit☆145Updated last year
- cricket is a virtualization solution for GPUs☆215Updated last month
- Systematic and comprehensive benchmarks for LLM systems.☆38Updated 3 weeks ago