A tool to detect infrastructure issues on cloud native AI systems
☆53Sep 18, 2025Updated 9 months ago
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Queuing and quota management for AI/ML batch jobs on Kubernetes☆17Jul 16, 2025Updated 11 months ago
- llm-d benchmark scripts and tooling☆62Updated this week
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 11 months ago
- Failure dataset accompanying the paper "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computi…☆10Jun 12, 2020Updated 6 years ago
- Comprehensive Parallel I/O Tracing and Analysis☆52Apr 16, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- ☆15Jan 7, 2023Updated 3 years ago
- Predict the performance of LLM inference services☆23Sep 18, 2025Updated 9 months ago
- A hierarchical collective communications library with portable optimizations☆38Dec 8, 2024Updated last year
- DXT Explorer is an interactive web-based log analysis tool for Darshan DXT logs.☆18Feb 19, 2026Updated 4 months ago
- Solution Service Architecture☆26Jun 5, 2024Updated 2 years ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆485Updated this week
- Red Hat Certified optional operator for secondary schedulers☆21Updated this week
- The MPI parallel MD-Workbench simulates user activities.☆12Jun 23, 2019Updated 7 years ago
- This is repository for a I/O benchmark which represents Scientific Deep Learning Workloads.☆24Dec 6, 2022Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Drishti provides I/O insights to help you improve your application's I/O performance.☆25Mar 3, 2026Updated 4 months ago
- Auto-tuning for vllm. Getting the best performance out of your LLM deployment (vllm+guidellm+optuna)☆59Jun 12, 2026Updated 3 weeks ago
- Augmented Dickey-Fuller implementation in Go☆12Mar 15, 2019Updated 7 years ago
- Snapped is a parallel program snapshotter designed for debugging deadlocks and crashes in programs. It acts as a wrapper around the GDB M…☆11Aug 26, 2024Updated last year
- ☆21Apr 25, 2026Updated 2 months ago
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆325Aug 20, 2024Updated last year
- compiler for fortran stencils using verified lifting,☆20Apr 5, 2022Updated 4 years ago
- ☆10Dec 10, 2024Updated last year
- Nabla Containers blog☆12May 26, 2021Updated 5 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- KAR: A Runtime for the Hybrid Cloud☆31Sep 17, 2025Updated 9 months ago
- A clean monorepo template for a Python project using uv☆13Jul 8, 2025Updated 11 months ago
- Code and other materials for the S2I2 Software Summer School☆12Mar 11, 2017Updated 9 years ago
- example.on('end', mustCall(() => {})); Check the callback function is called.☆11Nov 20, 2022Updated 3 years ago
- ☆10Apr 7, 2020Updated 6 years ago
- A curated list of autonomous research systems and tools.☆120Apr 3, 2026Updated 3 months ago
- Distributed AI/HPC Monitoring Framework☆29Apr 11, 2025Updated last year
- ☆36Oct 31, 2025Updated 8 months ago
- Simulation infrastructure and validation of Cori☆13Mar 22, 2022Updated 4 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Official repository for paper "KeyEE: Enhancing Low-resource Generative Event Extraction with Auxiliary Keyword Sub-Prompt"☆10Jun 5, 2024Updated 2 years ago
- Pytorch implementation for the pilot study on the robustness of latent diffusion models.☆12Jun 20, 2023Updated 3 years ago
- 12 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/☆10Nov 16, 2023Updated 2 years ago
- A multi-platform experimentation framework written in python.☆69Jun 25, 2026Updated last week
- Simulator for HDD/SSD, derived from the CMU PDL DiskSim, with the SSD-add-on patch from Microsoft Research applied.☆15Dec 30, 2019Updated 6 years ago
- CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift (FSE 2025)☆15Jun 25, 2026Updated last week
- This project focuses on simulating a multi-tier storage system🔺, with an emphasis on data management📂🔄 through the implementation of v…☆31Mar 5, 2026Updated 3 months ago