IBM / autopilot
A tool to detect infrastructure issues on cloud native AI systems
☆35Updated last month
Alternatives and similar repositories for autopilot:
Users that are interested in autopilot are comparing it to the libraries listed below
- Cloud Native Benchmarking of Foundation Models☆32Updated 6 months ago
- NVIDIA NCCL Tests for Distributed Training☆89Updated this week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- rFaaS: a high-performance FaaS platform with RDMA acceleration for low-latency invocations.☆51Updated 3 weeks ago
- NCCL Profiling Kit☆133Updated 10 months ago
- ☆44Updated 3 years ago
- Tiresias is a GPU cluster manager for distributed deep learning training.☆153Updated 5 years ago
- Artifacts for our NSDI'23 paper TGS☆75Updated 10 months ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆82Updated last year
- Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020☆128Updated 9 months ago
- An I/O benchmark for deep Learning applications☆87Updated last week
- Fine-grained GPU sharing primitives☆141Updated 5 years ago
- Wukong: A scalable and locality-enhanced serverless parallel framework (ACM SoCC'20)☆74Updated 6 months ago
- An interference-aware scheduler for fine-grained GPU sharing☆133Updated 3 months ago
- Magnum IO community repo☆90Updated 3 months ago
- FaaSNet: Scalable and Fast Provisioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute (USENIX ATC'21)☆54Updated 3 years ago
- Automatic tuning for ML model deployment on Kubernetes☆80Updated 6 months ago
- Intelligent platform for AI workloads☆37Updated 2 years ago
- Intent Driven Orchestration enables management of applications through their Service Level Objectives, while minimizing developer and adm…☆37Updated 3 weeks ago
- ☆186Updated 5 years ago
- SOTA Learning-augmented Systems☆36Updated 2 years ago
- 🔮 Execution time predictions for deep neural network training iterations across different GPUs.☆62Updated 2 years ago
- GPU-scheduler-for-deep-learning☆205Updated 4 years ago
- The source code of INFless,a native serverless platform for AI inference.☆38Updated 2 years ago
- The criu-coordinator tool aims to enable checkpoint/restore support for distributed applications with CRIU.☆21Updated 2 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- Microsoft Collective Communication Library☆65Updated 5 months ago
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling☆58Updated last year
- ☆24Updated last year
- This repository contains experimental tools we developed to forecast a clusters' resource (CPU or memory) usage.☆39Updated 4 years ago