A tool to detect infrastructure issues on cloud native AI systems
☆52Sep 18, 2025Updated 6 months ago
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below
Sorting:
- Queuing and quota management for AI/ML batch jobs on Kubernetes☆15Jul 16, 2025Updated 8 months ago
- AppWrapper controller for Kueue☆17Updated this week
- llm-d benchmark scripts and tooling☆49Updated this week
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 7 months ago
- Comprehensive Parallel I/O Tracing and Analysis☆52Apr 16, 2025Updated 11 months ago
- ☆15Jan 7, 2023Updated 3 years ago
- Holistic job manager on Kubernetes☆116Feb 20, 2024Updated 2 years ago
- Predict the performance of LLM inference services☆23Sep 18, 2025Updated 6 months ago
- ☆39Updated this week
- A hierarchical collective communications library with portable optimizations☆37Dec 8, 2024Updated last year
- DXT Explorer is an interactive web-based log analysis tool for Darshan DXT logs.☆17Feb 19, 2026Updated last month
- Real-Time Intrusion Detection and Prevention with Neural Network in Kernel using eBPF☆23Apr 9, 2024Updated last year
- hosted by HPC System Test Working Group collaboration☆17Feb 17, 2026Updated last month
- Red Hat Certified optional operator for secondary schedulers☆22Updated this week
- The MPI parallel MD-Workbench simulates user activities.☆12Jun 23, 2019Updated 6 years ago
- This is repository for a I/O benchmark which represents Scientific Deep Learning Workloads.☆23Dec 6, 2022Updated 3 years ago
- ☆14Mar 12, 2026Updated last week
- Auto-tuning for vllm. Getting the best performance out of your LLM deployment (vllm+guidellm+optuna)☆50Mar 12, 2026Updated last week
- Augmented Dickey-Fuller implementation in Go☆12Mar 15, 2019Updated 7 years ago
- Fast and efficient attention method exploration and implementation.☆25Mar 25, 2025Updated 11 months ago
- [DEPRECATED] Prometheus exporter for VPA recommendations☆12Aug 22, 2023Updated 2 years ago
- ☆21Feb 14, 2026Updated last month
- Utilities for ROCm Tech Support Log Collections☆13Mar 14, 2026Updated last week
- Snapped is a parallel program snapshotter designed for debugging deadlocks and crashes in programs. It acts as a wrapper around the GDB M…☆11Aug 26, 2024Updated last year
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆323Aug 20, 2024Updated last year
- A suite of parallel file system tools designed for performance and scalability☆29May 14, 2024Updated last year
- ☆10Dec 10, 2024Updated last year
- Gridsim simulator☆12May 12, 2017Updated 8 years ago
- Code and other materials for the S2I2 Software Summer School☆12Mar 11, 2017Updated 9 years ago
- KAR: A Runtime for the Hybrid Cloud☆30Sep 17, 2025Updated 6 months ago
- A clean monorepo template for a Python project using uv☆13Jul 8, 2025Updated 8 months ago
- Apache OpenWhisk Composer provides a high-level programming model in JavaScript for composing serverless functions☆69Sep 24, 2024Updated last year
- example.on('end', mustCall(() => {})); Check the callback function is called.☆10Nov 20, 2022Updated 3 years ago
- ☆17Nov 3, 2025Updated 4 months ago
- ☆10Apr 7, 2020Updated 5 years ago
- ☆11Aug 27, 2022Updated 3 years ago
- Distributed AI/HPC Monitoring Framework☆29Apr 11, 2025Updated 11 months ago
- ☆32Oct 31, 2025Updated 4 months ago
- Simulation infrastructure and validation of Cori☆13Mar 22, 2022Updated 4 years ago