IBM / autopilotLinks
A tool to detect infrastructure issues on cloud native AI systems
☆37Updated last month
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below
Sorting:
- Cloud Native Benchmarking of Foundation Models☆36Updated last week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆83Updated last year
- NVIDIA NCCL Tests for Distributed Training☆93Updated last week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- Health checks for Azure N- and H-series VMs.☆44Updated last month
- Intelligent platform for AI workloads☆37Updated 2 years ago
- Systematic and comprehensive benchmarks for LLM systems.☆15Updated last week
- ☆43Updated last year
- A light weight vLLM simulator, for mocking out replicas.☆24Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆98Updated 2 months ago
- NCCL Profiling Kit☆137Updated 11 months ago
- ☆44Updated 3 years ago
- RDMA CNI plugin for containerized workloads☆53Updated this week
- An I/O benchmark for deep Learning applications☆87Updated 3 weeks ago
- rFaaS: a high-performance FaaS platform with RDMA acceleration for low-latency invocations.☆51Updated last week
- Magnum IO community repo☆95Updated last month
- ☆24Updated last year
- An interference-aware scheduler for fine-grained GPU sharing☆140Updated 4 months ago
- Kubernetes Rdma SRIOV device plugin☆111Updated 4 years ago
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆67Updated last month
- A toolkit for discovering cluster network topology.☆54Updated last week
- A tool for coordinated checkpoint/restore of distributed applications with CRIU☆23Updated 2 weeks ago
- Holistic job manager on Kubernetes☆116Updated last year
- Artifacts for our NSDI'23 paper TGS☆76Updated last year
- Intent Driven Orchestration enables management of applications through their Service Level Objectives, while minimizing developer and adm…☆38Updated 2 months ago
- A TUI-based utility for real-time monitoring of InfiniBand traffic and performance metrics on the local node☆23Updated 3 weeks ago
- ☆38Updated 5 months ago
- Automatic tuning for ML model deployment on Kubernetes☆80Updated 7 months ago
- Microsoft Collective Communication Library☆64Updated 6 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year