IBM / autopilotLinks
A tool to detect infrastructure issues on cloud native AI systems
☆36Updated last week
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below
Sorting:
- Cloud Native Benchmarking of Foundation Models☆34Updated 2 weeks ago
- NVIDIA NCCL Tests for Distributed Training☆91Updated last week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆83Updated last year
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- Intelligent platform for AI workloads☆37Updated 2 years ago
- ☆44Updated 3 years ago
- A light weight vLLM simulator, for mocking out replicas.☆18Updated this week
- Magnum IO community repo☆95Updated 2 weeks ago
- ☆11Updated last week
- Automatic tuning for ML model deployment on Kubernetes☆80Updated 6 months ago
- ☆23Updated last year
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆140Updated this week
- Holistic job manager on Kubernetes☆115Updated last year
- Microsoft Collective Communication Library☆65Updated 6 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆96Updated 2 months ago
- ☆42Updated last year
- Artifacts for our NSDI'23 paper TGS☆75Updated 11 months ago
- NCCL Profiling Kit☆134Updated 10 months ago
- ☆62Updated 11 months ago
- ☆35Updated 4 months ago
- Predict the performance of LLM inference services☆18Updated 3 weeks ago
- Health checks for Azure N- and H-series VMs.☆41Updated last month
- An interference-aware scheduler for fine-grained GPU sharing☆137Updated 4 months ago
- The criu-coordinator tool aims to enable checkpoint/restore support for distributed applications with CRIU.☆21Updated 2 months ago
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆54Updated 2 years ago
- SOTA Learning-augmented Systems☆36Updated 3 years ago
- 🔮 Execution time predictions for deep neural network training iterations across different GPUs.☆62Updated 2 years ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆121Updated last year
- A resilient distributed training framework☆95Updated last year