Harbor is a framework for running agent evaluations and creating and using RL environments.
☆836Mar 3, 2026Updated this week
Alternatives and similar repositories for harbor
Users that are interested in harbor are comparing it to the libraries listed below
Sorting:
- A benchmark for LLMs on complicated tasks in the terminal☆1,651Jan 22, 2026Updated last month
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆35Apr 17, 2025Updated 10 months ago
- [COLING25] CodeJudge Eval: Can Large Language Models be Good Judges in Code Understanding?☆12Dec 3, 2024Updated last year
- Our library for RL environments + evals☆3,869Feb 28, 2026Updated last week
- Training Models Daily☆16Dec 19, 2023Updated 2 years ago
- Fluid Language Model Benchmarking☆26Sep 16, 2025Updated 5 months ago
- Entropy Based Sampling and Parallel CoT Decoding☆17Oct 9, 2024Updated last year
- SkyRL: A Modular Full-stack RL Library for LLMs☆1,656Updated this week
- Open sourced backend for Martian's LLM Inference Provider Leaderboard☆21Aug 13, 2024Updated last year
- Training GPTs to solve interaction nets☆18Aug 14, 2024Updated last year
- Benchmarking Goal-Oriented Software Engineering☆115Jan 7, 2026Updated last month
- Data recipes and robust infrastructure for training AI agents☆104Feb 28, 2026Updated last week
- Evaluation utilities based on SymPy.☆21Dec 12, 2024Updated last year
- Aidan Bench attempts to measure <big_model_smell> in LLMs.☆318Jun 26, 2025Updated 8 months ago
- [ACL 2025 Main] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆41Dec 13, 2024Updated last year
- SWE-bench: Can Language Models Resolve Real-world Github Issues?☆4,385Feb 19, 2026Updated 2 weeks ago
- Approximating the joint distribution of language models via MCTS☆22Nov 3, 2024Updated last year
- Convert GitHub PRs into Harbor tasks☆47Feb 27, 2026Updated last week
- Terminal-Bench-Science: Evaluating AI Agents on Complex Real-World Scientific Workflows in the Terminal☆25Updated this week
- AAIF landscape☆33Jan 15, 2026Updated last month
- Run evals using LLM☆27Jan 8, 2026Updated last month
- Agentless🐱: an agentless approach to automatically solve software development problems☆2,011Dec 22, 2024Updated last year
- Super basic implementation (gist-like) of RLMs with REPL environments.☆700Jan 7, 2026Updated 2 months ago
- Secure Nix sandbox for LLM agents☆27Updated this week
- ☆13Apr 7, 2024Updated last year
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆644Jul 29, 2025Updated 7 months ago
- [ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution☆235Updated this week
- ☆4,368Jul 31, 2025Updated 7 months ago
- DeMo: Decoupled Momentum Optimization☆198Dec 2, 2024Updated last year
- Async RL Training at Scale☆1,107Updated this week
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆584Updated this week
- A simple tutorial script on Streamlit using the Iris Dataset☆12Sep 13, 2023Updated 2 years ago
- Vast.ai python sdk☆21Feb 28, 2026Updated last week
- Python client for Google Kaniko☆11Jul 19, 2022Updated 3 years ago
- Probing task; contextual embeddings -> textual definitions (EMNLP19)☆11Apr 22, 2021Updated 4 years ago
- ☆12May 30, 2025Updated 9 months ago
- Convert a regular GPT call into a ChatGPT call☆14Mar 2, 2023Updated 3 years ago
- ☆12Feb 11, 2026Updated 3 weeks ago
- A command-line interface tool for creating, managing, and verifying Content Provenance and Authenticity (C2PA) manifests for machine lear…☆21Feb 25, 2026Updated last week