OrderLab/TrainCheck

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/OrderLab/TrainCheck)

OrderLab / TrainCheck

An Observability Framework for AI Training

☆73

Alternatives and similar repositories for TrainCheck

Users that are interested in TrainCheck are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

OrderLab / awesome-machine-learning-reliability
View on GitHub
A curated reading list for machine learning reliability research and practice
☆30Sep 18, 2025Updated 10 months ago
OrderLab / xinda
View on GitHub
Automated Testing and Adaptive Detection of **Slow Faults** in Distributed Systems
☆19Mar 6, 2025Updated last year
microsoft / TrainVerify
View on GitHub
A verification tool for ensuring parallelization equivalence in distributed model training.
☆17Sep 1, 2025Updated 10 months ago
OrderLab / ePass
View on GitHub
A compiler framework for eBPF programs
☆21Jul 11, 2026Updated last week
OrderLab / Legolas
View on GitHub
Legolas: A Fault Injection Framework for Efficient Exposure of Partial Failures in Distributed Systems
☆11Mar 29, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
OrderLab / orbit
View on GitHub
Orbit: OS Support for Safe and Efficient Auxiliary Tasks in Applications
☆22May 23, 2022Updated 4 years ago
open-neutrino / neutrino
View on GitHub
☆263Dec 25, 2025Updated 6 months ago
824728350 / Zodiac
View on GitHub
Zodiac: Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs, SOSP 2024
☆15Nov 28, 2024Updated last year
bastoica / wasabi
View on GitHub
Wasabi is a toolkit designed to isolate and trigger retry bugs by combining static program analysis, large language models (LLMs), fault …
☆10Oct 8, 2024Updated last year
SunHao-0 / BCF
View on GitHub
eBPF Certificate Framework
☆20Jan 3, 2026Updated 6 months ago
self-checker / SelfChecker
View on GitHub
ICSE2021 Submission
☆13Aug 28, 2022Updated 3 years ago
Terra-Flux / PolyRL
View on GitHub
[NSDI'26] PolyRL is a reinforcement learning framework for LLM that harvest spot instances on the cloud to reduce cost.
☆19Mar 30, 2026Updated 3 months ago
shijy16 / ACETest
View on GitHub
For our ISSTA'23 paper ACETest: Automated Constraint Extraction for Testing Deep Learning Operators
☆17Apr 28, 2026Updated 2 months ago
DiT-Serving / TetriServe
View on GitHub
[ASPLOS' 26] TetriServe: Efficiently Serving Mixed DiT Workloads
☆17Mar 12, 2026Updated 4 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
xlab-uiuc / AIOpsLab
View on GitHub
A holistic framework to enable the design, development, and evaluation of autonomous AIOps agents.
☆11May 21, 2025Updated last year
NEO-MLSys25 / NEO
View on GitHub
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆99Jun 16, 2025Updated last year
LiftLab-UVA / PilotExecution
View on GitHub
[NSDI26] Pilot Execution: Simulating Failure Recovery In Situ for Production Distributed Systems
☆19Mar 11, 2026Updated 4 months ago
SymbioticLab / Oobleck
View on GitHub
A resilient distributed training framework
☆100Apr 11, 2024Updated 2 years ago
sysartifacts / sysartifacts.github.io
View on GitHub
Website for Artifact Evaluation at EuroSys, SOSP, OSDI, ATC
☆53Jul 12, 2026Updated last week
eunomia-bpf / nccl-eBPF
View on GitHub
☆20Jul 7, 2026Updated 2 weeks ago
llylly / RANUM
View on GitHub
[ICSE 2023] Differentiable interpretation and failure-inducing input generation for neural network numerical bugs.
☆13Jan 5, 2024Updated 2 years ago
ChijinZ / PolyJuice-Fuzzer
View on GitHub
A DL compiler fuzzer
☆15Nov 1, 2024Updated last year
GLaDOS-Michigan / verification-class
View on GitHub
Material for the class on verification of distributed and asynchronous systems, developed by Jon Howell and Manos Kapritsos
☆12Feb 7, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ise-uiuc / KNighter
View on GitHub
[SOSP'25] Automatic checker synthesis for system-level static analysis
☆181Oct 26, 2025Updated 8 months ago
itbench-hub / ITBench-Scenarios
View on GitHub
⚠️ ARCHIVED - All development moved to https://github.com/itbench-hub/ITBench/tree/main/scenarios
☆16Feb 24, 2026Updated 4 months ago
xlab-uiuc / reading-system-verification-papers
View on GitHub
A reading group for system verification papers
☆10Sep 28, 2023Updated 2 years ago
MincYu / pheromone
View on GitHub
☆45Nov 15, 2022Updated 3 years ago
xlab-uiuc / stratus
View on GitHub
☆22Oct 23, 2025Updated 8 months ago
THU-feiyue / database
View on GitHub
清华大学飞跃数据库
☆34Updated this week
ise-uiuc / WhiteFox
View on GitHub
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models (OOPSLA 2024)
☆84Aug 5, 2025Updated 11 months ago
InternLM / AcmeTrace
View on GitHub
☆179Mar 12, 2024Updated 2 years ago
zhang677 / PCL-lite
View on GitHub
[ICML 2025] Adaptive Self-improvement LLM Agentic System for ML Library Development
☆17Jan 6, 2026Updated 6 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
WM-SEMERU / ACER
View on GitHub
ACER is an AST-based Callgraph Generator Development Framework
☆41Jun 17, 2024Updated 2 years ago
shady1543 / eACGM
View on GitHub
[IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.
☆23Aug 11, 2025Updated 11 months ago
NVlabs / nvbitfi
View on GitHub
Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
☆84Oct 17, 2023Updated 2 years ago
Pacific73 / Heracles
View on GitHub
A simple implementation of Google Heracles System (isca15).
☆10Jun 8, 2020Updated 6 years ago
NVIDIA / Fabric-Manager-Client
View on GitHub
This is a tool for managing GPU partitions for NVIDIA Fabric Manager’s Shared NVSwitch.
☆17Jul 2, 2026Updated 2 weeks ago
liushulinle / MarsRL
View on GitHub
MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism
☆18Nov 18, 2025Updated 8 months ago
IBM / LLM-performance-prediction
View on GitHub
Predict the performance of LLM inference services
☆23Sep 18, 2025Updated 10 months ago