A tool to detect infrastructure issues on cloud native AI systems
☆53Sep 18, 2025Updated 8 months ago
Alternatives and similar repositories for autopilot
Users that are interested in autopilot are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Queuing and quota management for AI/ML batch jobs on Kubernetes☆17Jul 16, 2025Updated 10 months ago
- AppWrapper controller for Kueue☆17May 22, 2026Updated 2 weeks ago
- llm-d benchmark scripts and tooling☆63Updated this week
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 10 months ago
- Failure dataset accompanying the paper "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computi…☆10Jun 12, 2020Updated 5 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Comprehensive Parallel I/O Tracing and Analysis☆52Apr 16, 2025Updated last year
- ☆15Jan 7, 2023Updated 3 years ago
- Holistic job manager on Kubernetes☆117Feb 20, 2024Updated 2 years ago
- Predict the performance of LLM inference services☆23Sep 18, 2025Updated 8 months ago
- DXT Explorer is an interactive web-based log analysis tool for Darshan DXT logs.☆18Feb 19, 2026Updated 3 months ago
- Real-Time Intrusion Detection and Prevention with Neural Network in Kernel using eBPF☆25Apr 9, 2024Updated 2 years ago
- hosted by HPC System Test Working Group collaboration☆17Apr 30, 2026Updated last month
- Red Hat Certified optional operator for secondary schedulers☆21Updated this week
- Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces☆30Jan 9, 2026Updated 5 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Auto-tuning for vllm. Getting the best performance out of your LLM deployment (vllm+guidellm+optuna)☆57Mar 17, 2026Updated 2 months ago
- Augmented Dickey-Fuller implementation in Go☆12Mar 15, 2019Updated 7 years ago
- [DEPRECATED] Prometheus exporter for VPA recommendations☆12Aug 22, 2023Updated 2 years ago
- Fast and efficient attention method exploration and implementation.☆26Mar 25, 2025Updated last year
- Snapped is a parallel program snapshotter designed for debugging deadlocks and crashes in programs. It acts as a wrapper around the GDB M…☆11Aug 26, 2024Updated last year
- compiler for fortran stencils using verified lifting,☆20Apr 5, 2022Updated 4 years ago
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆326Aug 20, 2024Updated last year
- ☆10Dec 10, 2024Updated last year
- Nabla Containers blog☆12May 26, 2021Updated 5 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- KAR: A Runtime for the Hybrid Cloud☆31Sep 17, 2025Updated 8 months ago
- Code and other materials for the S2I2 Software Summer School☆12Mar 11, 2017Updated 9 years ago
- Apache OpenWhisk Composer provides a high-level programming model in JavaScript for composing serverless functions☆68Sep 24, 2024Updated last year
- example.on('end', mustCall(() => {})); Check the callback function is called.☆11Nov 20, 2022Updated 3 years ago
- ☆17May 28, 2026Updated 2 weeks ago
- ☆10Apr 7, 2020Updated 6 years ago
- ☆12Aug 27, 2022Updated 3 years ago
- Distributed AI/HPC Monitoring Framework☆29Apr 11, 2025Updated last year
- ☆34Oct 31, 2025Updated 7 months ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Simulation infrastructure and validation of Cori☆13Mar 22, 2022Updated 4 years ago
- This repository will host the Developer Guide for the IBM Garage Cloud Native Toolkit☆31Apr 24, 2023Updated 3 years ago
- Pytorch implementation for the pilot study on the robustness of latent diffusion models.☆12Jun 20, 2023Updated 2 years ago
- 12 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/☆10Nov 16, 2023Updated 2 years ago
- CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift (FSE 2025)☆14May 19, 2025Updated last year
- Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscali…☆127Updated this week
- This project focuses on simulating a multi-tier storage system🔺, with an emphasis on data management📂🔄 through the implementation of v…☆30Mar 5, 2026Updated 3 months ago