A tool to detect infrastructure issues on cloud native AI systems
☆52Sep 18, 2025Updated 5 months ago
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below
Sorting:
- Queuing and quota management for AI/ML batch jobs on Kubernetes☆14Jul 16, 2025Updated 7 months ago
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 7 months ago
- llm-d benchmark scripts and tooling☆48Updated this week
- The MPI parallel MD-Workbench simulates user activities.☆12Jun 23, 2019Updated 6 years ago
- ☆15Jan 7, 2023Updated 3 years ago
- AppWrapper controller for Kueue☆17Feb 11, 2026Updated 2 weeks ago
- hosted by HPC System Test Working Group collaboration☆17Feb 17, 2026Updated last week
- DXT Explorer is an interactive web-based log analysis tool for Darshan DXT logs.☆17Feb 19, 2026Updated last week
- A hierarchical collective communications library with portable optimizations☆37Dec 8, 2024Updated last year
- Demo for DevNation Tech Talk 2020 (Debezium/Kafka Streams/Quarkus/Knative)☆16Nov 18, 2024Updated last year
- Predict the performance of LLM inference services☆21Sep 18, 2025Updated 5 months ago
- Holistic job manager on Kubernetes☆116Feb 20, 2024Updated 2 years ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆476Updated this week
- Drishti provides I/O insights to help you improve your application's I/O performance.☆23Feb 18, 2026Updated last week
- Femto-Containers RIOT Implementation & Hands-on Tutorials☆25Feb 16, 2023Updated 3 years ago
- Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces☆27Jan 9, 2026Updated last month
- ☆32Oct 31, 2025Updated 4 months ago
- This is repository for a I/O benchmark which represents Scientific Deep Learning Workloads.☆23Dec 6, 2022Updated 3 years ago
- Compute processor utilization and system call processing metrics based on "perf" trace data☆24May 17, 2021Updated 4 years ago
- ☆323Aug 20, 2024Updated last year
- Solution Service Architecture☆25Jun 5, 2024Updated last year
- Material for the SC21 Deep Learning at Scale Tutorial☆27Feb 13, 2023Updated 3 years ago
- A suite of parallel file system tools designed for performance and scalability☆29May 14, 2024Updated last year
- ☆40Feb 19, 2026Updated last week
- ☆11Aug 27, 2022Updated 3 years ago
- Create and deploy virtual-experiments - co-processing computational workflows☆10Jan 28, 2026Updated last month
- Storage Scale Installation and Configuration☆79Feb 20, 2026Updated last week
- A command line utility to manage the configuration of a system's high performance network interfaces for RoCE deployments☆35Jul 25, 2023Updated 2 years ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆264Updated this week
- Systematic and comprehensive benchmarks for LLM systems.