METR/RE-Bench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/METR/RE-Bench)

METR / RE-Bench

☆136

Alternatives and similar repositories for RE-Bench

Users that are interested in RE-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

METR / vivaria
View on GitHub
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆137Feb 15, 2026Updated 3 months ago
METR / task-standard
View on GitHub
METR Task Standard
☆180Feb 3, 2025Updated last year
METR / public-tasks
View on GitHub
☆123Jan 19, 2026Updated 4 months ago
Cadenza-Labs / sleeper-agents
View on GitHub
☆14Jul 12, 2024Updated last year
siegelz / core-bench
View on GitHub
☆74Nov 23, 2025Updated 5 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
violet-zct / swarm-distillation-zero-shot
View on GitHub
☆23Oct 15, 2022Updated 3 years ago
facebookresearch / llm-speedrunner
View on GitHub
The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…
☆142May 6, 2026Updated 2 weeks ago
epoch-research / ftm
View on GitHub
Work in progress! I don't recommend looking at the code right now.
☆24May 9, 2026Updated last week
isle-dev / MetricEval
View on GitHub
MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…
☆12Nov 6, 2023Updated 2 years ago
MichaelEinhorn / trl-textworld
View on GitHub
☆13May 7, 2023Updated 3 years ago
nilesc / Long-Structured-Debate-Generation-and-Evaluation
View on GitHub
☆13Dec 8, 2022Updated 3 years ago
longtermrisk / openweights
View on GitHub
A python sdk for LLM finetuning and inference on runpod infrastructure
☆30May 12, 2026Updated last week
OSU-NLP-Group / ScienceAgentBench
View on GitHub
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆137Apr 29, 2026Updated 3 weeks ago
danielmamay / mlab
View on GitHub
Machine Learning for Alignment Bootcamp (MLAB).
☆33Jan 24, 2022Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
neelnanda-io / Neuroscope
View on GitHub
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
☆13Feb 13, 2023Updated 3 years ago
EleutherAI / elk
View on GitHub
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆219Updated this week
UKGovernmentBEIS / hibayes
View on GitHub
☆49Updated this week
UKGovernmentBEIS / inspect_ai
View on GitHub
Inspect: A framework for large language model evaluations
☆2,096Updated this week
snap-stanford / MLAgentBench
View on GitHub
☆342Jun 19, 2024Updated last year
callummcdougall / ARENA_3.0
View on GitHub
☆1,082May 12, 2026Updated last week
locuslab / intermediate_robustness
View on GitHub
☆15Dec 7, 2021Updated 4 years ago
Phylliida / MambaLens
View on GitHub
Mamba support for transformer lens
☆20Sep 17, 2024Updated last year
gso-bench / gso
View on GitHub
[NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
☆82Apr 27, 2026Updated 3 weeks ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
dlwh / jax_sourceror
View on GitHub
Turn jitted jax functions back into python source code
☆23Dec 16, 2024Updated last year
andyzoujm / representation-engineering
View on GitHub
Representation Engineering: A Top-Down Approach to AI Transparency
☆994Aug 14, 2024Updated last year
AsaCooperStickland / situational-awareness-evals
View on GitHub
Measuring the situational awareness of language models
☆41Feb 12, 2024Updated 2 years ago
lunary-ai / llm-benchmarks
View on GitHub
LLM benchmarks
☆13Feb 22, 2024Updated 2 years ago
likenneth / persona_drift
View on GitHub
Measuring and Controlling Persona Drift in Language Model Dialogs
☆25Feb 26, 2024Updated 2 years ago
princeton-pli / hal-harness
View on GitHub
☆285Updated this week
redwoodresearch / remix_public
View on GitHub
☆20Feb 17, 2023Updated 3 years ago
zengqunzhao / AIM-Fair
View on GitHub
[CVPR'25] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
☆17Mar 27, 2025Updated last year
safety-research / SHADE-Arena
View on GitHub
☆26Jun 22, 2025Updated 11 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
pdejorge / N-FGSM
View on GitHub
Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)
☆25Oct 17, 2022Updated 3 years ago
UlisseMini / procgen-tools
View on GitHub
Tools for running experiments on RL agents in procgen environments
☆20Apr 5, 2024Updated 2 years ago
justincui03 / or-bench
View on GitHub
[ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"
☆26Mar 4, 2025Updated last year
CosineAI / experiments
View on GitHub
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆15Sep 4, 2024Updated last year
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆329Updated this week
saprmarks / dictionary_learning
View on GitHub
☆416Aug 21, 2025Updated 9 months ago
UKGovernmentBEIS / inspect_evals
View on GitHub
Collection of evals for Inspect AI
☆498Updated this week