harbor-framework/terminal-bench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/harbor-framework/terminal-bench)

harbor-framework / terminal-bench

A benchmark for LLMs on complicated tasks in the terminal

☆2,483

Alternatives and similar repositories for terminal-bench

Users that are interested in terminal-bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

harbor-framework / harbor
View on GitHub
Framework for evaluating and improving agents
☆3,504Updated this week
harbor-framework / terminal-bench-2
View on GitHub
☆344Apr 30, 2026Updated 2 months ago
harbor-framework / frontier-bench
View on GitHub
Measuring and evolving with the frontier of agent work
☆387Updated this week
SWE-bench / SWE-bench
View on GitHub
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆5,483Apr 1, 2026Updated 3 months ago
SWE-bench / SWE-smith
View on GitHub
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆711Updated this week
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
NovaSky-AI / SkyRL
View on GitHub
SkyRL: A Modular Full-stack RL Library for LLMs
☆2,093Updated this week
Danau5tin / terminal-bench-rl
View on GitHub
GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's T…
☆399Aug 24, 2025Updated 11 months ago
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆711Jul 29, 2025Updated 11 months ago
sierra-research / tau2-bench
View on GitHub
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
☆1,662Updated this week
THUDM / slime
View on GitHub
slime is an LLM post-training framework for RL Scaling.
☆7,629Updated this week
hkust-nlp / Toolathlon
View on GitHub
[ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆440Updated this week
SWE-agent / mini-swe-agent
View on GitHub
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—b…
☆6,017Updated this week
R2E-Gym / R2E-Gym
View on GitHub
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆310Jul 13, 2025Updated last year
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,654Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
scaleapi / SWE-bench_Pro-os
View on GitHub
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
☆487May 18, 2026Updated 2 months ago
sierra-research / tau-bench
View on GitHub
Code and Data for Tau-Bench
☆1,345Mar 18, 2026Updated 4 months ago
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆467Updated this week
PrimeIntellect-ai / verifiers
View on GitHub
Our library for RL environments + evals
☆4,400Updated this week
benchflow-ai / skillsbench
View on GitHub
SkillsBench evaluates how well skills work and how effective agents are at using them.
☆1,577Updated this week
rllm-org / rllm
View on GitHub
Democratizing Reinforcement Learning for LLMs
☆5,731Updated this week
Danau5tin / tbench-agentic-data-pipeline
View on GitHub
Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL training
☆71Jul 28, 2025Updated 11 months ago
open-thoughts / OpenThoughts-Agent
View on GitHub
Data recipes and robust infrastructure for training AI agents
☆265Updated this week
xlang-ai / OSWorld
View on GitHub
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
☆3,036Updated this week
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
areal-project / AReaL
View on GitHub
The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.
☆5,599Updated this week
harbor-framework / terminal-bench-science
View on GitHub
Terminal-Bench Science: Evaluating AI Agents on Complex Real-World Scientific Workflows in the Terminal
☆214Updated this week
SWE-agent / SWE-ReX
View on GitHub
Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.
☆555Updated this week
harbor-framework / terminal-bench-challenges
View on GitHub
☆19Jun 18, 2026Updated last month
LiveCodeBench / LiveCodeBench
View on GitHub
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆913Jul 16, 2025Updated last year
facebookresearch / ProgramBench
View on GitHub
Can Language Models Rebuild Programs From Scratch?
☆860Jul 14, 2026Updated last week
sgl-project / sglang
View on GitHub
SGLang is a high-performance serving framework for large language models and multimodal models.
☆30,733Updated this week
thinking-machines-lab / tinker-cookbook
View on GitHub
Post-training with Tinker
☆3,911Updated this week
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,407Jul 13, 2026Updated last week
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
microsoft / SWE-bench-Live
View on GitHub
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!
☆212Jun 11, 2026Updated last month
camel-ai / seta-env
View on GitHub
💻 SETA: Scaling Environments for Terminal Agents - Environments
☆143Feb 16, 2026Updated 5 months ago
openai / mle-bench
View on GitHub
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,655Apr 24, 2026Updated 3 months ago
stanford-iris-lab / meta-harness-tbench2-artifact
View on GitHub
Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)
☆1,150Mar 26, 2026Updated 4 months ago
vllm-project / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆87,138Updated this week
openai / frontier-evals
View on GitHub
OpenAI Frontier Evals
☆1,262Apr 21, 2026Updated 3 months ago
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,848Jul 14, 2026Updated last week