IBM / autopilotLinks
A tool to detect infrastructure issues on cloud native AI systems
☆52Updated 4 months ago
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below
Sorting:
- Cloud Native Benchmarking of Foundation Models☆45Updated 6 months ago
- NVIDIA NCCL Tests for Distributed Training☆134Updated last week
- Systematic and comprehensive benchmarks for LLM systems.☆50Updated last week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆88Updated last year
- Health checks for Azure N- and H-series VMs.☆57Updated this week
- llm-d benchmark scripts and tooling☆44Updated this week
- A toolkit for discovering cluster network topology.☆96Updated last week
- CUDA checkpoint and restore utility☆410Updated 4 months ago
- MIG Partition Editor for NVIDIA GPUs☆240Updated this week
- Automatic tuning for ML model deployment on Kubernetes☆81Updated last year
- ☆43Updated last year
- Holistic job manager on Kubernetes☆116Updated last year
- Share GPU between Pods in Kubernetes☆216Updated 3 years ago
- Intent Driven Orchestration enables management of applications through their Service Level Objectives, while minimizing developer and adm…☆48Updated 2 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆146Updated 10 months ago
- This repository contains experimental tools we developed to forecast a clusters' resource (CPU or memory) usage.☆44Updated 4 years ago
- Offline optimization of your disaggregated Dynamo graph☆177Updated last week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated 2 years ago
- DOCA Platform manages provisioning and service orchestration for Bluefield DPUs☆76Updated this week
- NVIDIA Network Operator☆320Updated this week
- A workload for deploying LLM inference services on Kubernetes☆168Updated last week
- Distributed KV cache scheduling & offloading libraries☆101Updated last week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆365Updated this week
- RDMA CNI plugin for containerized workloads☆58Updated 3 weeks ago
- Kubernetes Rdma SRIOV device plugin☆114Updated 5 years ago
- An I/O benchmark for deep Learning applications☆102Updated last month
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆176Updated last year
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆159Updated this week
- NVIDIA Networking NIC Configuration Operator For Kubernetes☆14Updated this week
- knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.☆74Updated 6 months ago