IBM / autopilot
A tool to detect infrastructure issues on cloud native AI systems
☆30Updated 3 weeks ago
Alternatives and similar repositories for autopilot:
Users that are interested in autopilot are comparing it to the libraries listed below
- Cloud Native Benchmarking of Foundation Models☆30Updated 5 months ago
- NVIDIA NCCL Tests for Distributed Training☆88Updated this week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆82Updated last year
- Automatic tuning for ML model deployment on Kubernetes☆81Updated 5 months ago
- Intelligent platform for AI workloads☆37Updated 2 years ago
- ☆44Updated 3 years ago
- Fine-grained GPU sharing primitives☆141Updated 5 years ago
- ☆42Updated 11 months ago
- An I/O benchmark for deep Learning applications☆82Updated last week
- ☆24Updated last year
- Tiresias is a GPU cluster manager for distributed deep learning training.☆152Updated 4 years ago
- Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020☆127Updated 8 months ago
- NCCL Profiling Kit☆129Updated 9 months ago
- ☆36Updated 4 months ago
- Artifacts for our NSDI'23 paper TGS☆75Updated 10 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- Magnum IO community repo☆89Updated 2 months ago
- Holistic job manager on Kubernetes☆115Updated last year
- An Efficient Dynamic Resource Scheduler for Deep Learning Clusters☆42Updated 7 years ago
- An interference-aware scheduler for fine-grained GPU sharing☆130Updated 2 months ago
- Microsoft Collective Communication Library☆65Updated 4 months ago
- rFaaS: a high-performance FaaS platform with RDMA acceleration for low-latency invocations.☆51Updated 3 weeks ago
- ☆41Updated 9 months ago
- Intent Driven Orchestration enables management of applications through their Service Level Objectives, while minimizing developer and adm…☆36Updated this week
- FaaSNet: Scalable and Fast Provisioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute (USENIX ATC'21)☆54Updated 3 years ago
- The criu-coordinator tool aims to enable checkpoint/restore support for distributed applications with CRIU.☆20Updated last month
- Forked form☆11Updated 4 years ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆134Updated last week
- RDMA CNI plugin for containerized workloads☆52Updated last week