harbor-framework/harbor

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/harbor-framework/harbor)

harbor-framework / harbor

Framework for evaluating and improving agents

☆3,504

Alternatives and similar repositories for harbor

Users that are interested in harbor are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

harbor-framework / terminal-bench
View on GitHub
A benchmark for LLMs on complicated tasks in the terminal
☆2,483Jul 11, 2026Updated 2 weeks ago
harbor-framework / harbor-cookbook
View on GitHub
Realistic examples of building evals and optimizing agents with Harbor
☆145Apr 23, 2026Updated 3 months ago
harbor-framework / frontier-bench
View on GitHub
Measuring and evolving with the frontier of agent work
☆387Updated this week
harbor-framework / terminal-bench-2
View on GitHub
☆344Apr 30, 2026Updated 2 months ago
NovaSky-AI / SkyRL
View on GitHub
SkyRL: A Modular Full-stack RL Library for LLMs
☆2,093Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
THUDM / slime
View on GitHub
slime is an LLM post-training framework for RL Scaling.
☆7,629Updated this week
benchflow-ai / skillsbench
View on GitHub
SkillsBench evaluates how well skills work and how effective agents are at using them.
☆1,577Updated this week
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,654Updated this week
hkust-nlp / Toolathlon
View on GitHub
[ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆440Updated this week
R2E-Gym / R2E-Gym
View on GitHub
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆310Jul 13, 2025Updated last year
rllm-org / rllm
View on GitHub
Democratizing Reinforcement Learning for LLMs
☆5,731Updated this week
PrimeIntellect-ai / verifiers
View on GitHub
Our library for RL environments + evals
☆4,400Updated this week
radixark / miles
View on GitHub
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
☆1,789Updated this week
SWE-bench / SWE-smith
View on GitHub
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆711Updated this week
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
PrimeIntellect-ai / prime-rl
View on GitHub
Agentic RL Training at Scale
☆1,724Updated this week
SWE-agent / mini-swe-agent
View on GitHub
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—b…
☆6,017Updated this week
SWE-bench / SWE-bench
View on GitHub
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆5,483Apr 1, 2026Updated 3 months ago
open-thoughts / OpenThoughts-Agent
View on GitHub
Data recipes and robust infrastructure for training AI agents
☆265Updated this week
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆467Updated this week
datacurve-ai / pier
View on GitHub
Pier is a Harbor fork built for DeepSWE, with stronger support for CLI agents in air-gapped (no-internet) tasks and more faithful, consis…
☆128Jul 12, 2026Updated 2 weeks ago
harbor-framework / harbor-datasets
View on GitHub
☆36May 16, 2026Updated 2 months ago
sierra-research / tau2-bench
View on GitHub
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
☆1,662Updated this week
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆711Jul 29, 2025Updated 11 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
NVIDIA-NeMo / ProRL-Agent-Server
View on GitHub
Agentic RL on Any Harness at Scale
☆706Jul 15, 2026Updated last week
facebookresearch / ProgramBench
View on GitHub
Can Language Models Rebuild Programs From Scratch?
☆860Jul 14, 2026Updated last week
abundant-ai / SWE-gen
View on GitHub
Convert GitHub PRs into Harbor tasks
☆72Jul 13, 2026Updated last week
areal-project / AReaL
View on GitHub
The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.
☆5,599Updated this week
harbor-framework / awesome-harbor
View on GitHub
A curated list of awesome Harbor ecosystem projects
☆48May 29, 2026Updated last month
abundant-ai / swe-marathon
View on GitHub
SWE-Marathon: an ultra long-horizon SWE benchmark
☆114Updated this week
huggingface / OpenEnv
View on GitHub
An interface library for RL post training with environments.
☆2,450Updated this week
sgl-project / sglang
View on GitHub
SGLang is a high-performance serving framework for large language models and multimodal models.
☆30,733Updated this week
Danau5tin / tbench-agentic-data-pipeline
View on GitHub
Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL training
☆71Jul 28, 2025Updated 11 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
thinking-machines-lab / tinker-cookbook
View on GitHub
Post-training with Tinker
☆3,911Updated this week
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,407Jul 13, 2026Updated last week
Mercor-Intelligence / archipelago
View on GitHub
Harness for running and evaluating AI agents against RL environments
☆224Updated this week
stanford-iris-lab / meta-harness-tbench2-artifact
View on GitHub
Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)
☆1,150Mar 26, 2026Updated 3 months ago
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,848Jul 14, 2026Updated last week
Danau5tin / terminal-bench-rl
View on GitHub
GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's T…
☆399Aug 24, 2025Updated 11 months ago
axon-rl / gem
View on GitHub
A Gym for Agentic LLMs
☆502Jan 21, 2026Updated 6 months ago