IBM / autopilot
A tool to detect infrastructure issues on cloud native AI systems
☆28Updated this week
Alternatives and similar repositories for autopilot:
Users that are interested in autopilot are comparing it to the libraries listed below
- Cloud Native Benchmarking of Foundation Models☆24Updated 4 months ago
- Intelligent platform for AI workloads☆37Updated 2 years ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆79Updated 11 months ago
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- ☆42Updated 10 months ago
- Automatic tuning for ML model deployment on Kubernetes☆81Updated 4 months ago
- NVIDIA NCCL Tests for Distributed Training☆85Updated last week
- Fine-grained GPU sharing primitives☆141Updated 5 years ago
- ☆23Updated last year
- Forked form☆10Updated 4 years ago
- The criu-coordinator tool aims to enable checkpoint/restore support for distributed applications with CRIU.☆20Updated 2 weeks ago
- FaaSNet: Scalable and Fast Provisioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute (USENIX ATC'21)☆54Updated 3 years ago
- rFaaS: a high-performance FaaS platform with RDMA acceleration for low-latency invocations.☆50Updated this week
- ☆14Updated 3 years ago
- SOTA Learning-augmented Systems☆35Updated 2 years ago
- ☆43Updated 3 years ago
- Tiresias is a GPU cluster manager for distributed deep learning training.☆152Updated 4 years ago
- Intent Driven Orchestration enables management of applications through their Service Level Objectives, while minimizing developer and adm…☆36Updated last week
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆51Updated 2 years ago
- An Efficient Dynamic Resource Scheduler for Deep Learning Clusters☆42Updated 7 years ago
- ☆237Updated this week
- Kubernetes Rdma SRIOV device plugin☆110Updated 4 years ago
- GPU-scheduler-for-deep-learning☆203Updated 4 years ago
- MeshInsight: Dissecting Overheads of Service Mesh Sidecars☆46Updated last year
- Wukong: A scalable and locality-enhanced serverless parallel framework (ACM SoCC'20)☆73Updated 4 months ago
- A resilient distributed training framework☆90Updated 11 months ago
- 🔮 Execution time predictions for deep neural network training iterations across different GPUs.☆60Updated 2 years ago
- Holistic job manager on Kubernetes☆112Updated last year
- RDMA CNI plugin for containerized workloads☆51Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆76Updated last month