Azure / azurehpc-health-checksLinks
Health checks for Azure N- and H-series VMs.
☆46Updated last week
Alternatives and similar repositories for azurehpc-health-checks
Users that are interested in azurehpc-health-checks are comparing it to the libraries listed below
Sorting:
- NVIDIA NCCL Tests for Distributed Training☆97Updated 2 weeks ago
- MIG Partition Editor for NVIDIA GPUs☆202Updated this week
- ☆276Updated last week
- NVIDIA Network Operator☆262Updated this week
- A tool to detect infrastructure issues on cloud native AI systems☆41Updated last month
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- Cloud Native Benchmarking of Foundation Models☆38Updated last month
- ☆43Updated last year
- ☆62Updated last week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆82Updated last year
- DPDK & SR-IOV CNI plugin☆19Updated last week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆158Updated last year
- RDMA CNI plugin for containerized workloads☆55Updated 2 weeks ago
- Go Abstraction for Allocating NVIDIA GPUs with Custom Policies☆114Updated last week
- Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies☆179Updated 2 months ago
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆67Updated 2 months ago
- RDMA device plugin for Kubernetes☆217Updated last year
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆177Updated this week
- GenAI inference performance benchmarking tool☆64Updated this week
- HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container☆182Updated last week
- Kubernetes Operator for AI and Bigdata Elastic Training☆87Updated 6 months ago
- cricket is a virtualization solution for GPUs☆205Updated last month
- ☆25Updated 3 weeks ago
- ☆66Updated 6 months ago
- RDMA and SHARP plugins for nccl library☆197Updated 3 weeks ago
- A light weight vLLM simulator, for mocking out replicas.☆30Updated this week
- Holistic job manager on Kubernetes☆116Updated last year
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆542Updated 2 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆113Updated 3 months ago
- CUDA checkpoint and restore utility☆345Updated 5 months ago