☆136Oct 16, 2025Updated 7 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆137Feb 15, 2026Updated 3 months ago
- METR Task Standard☆180Feb 3, 2025Updated last year
- ☆123Jan 19, 2026Updated 4 months ago
- ☆14Jul 12, 2024Updated last year
- ☆74Nov 23, 2025Updated 5 months ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆23Oct 15, 2022Updated 3 years ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆142May 6, 2026Updated 2 weeks ago
- Work in progress! I don't recommend looking at the code right now.☆24May 9, 2026Updated last week
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆13May 7, 2023Updated 3 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- A python sdk for LLM finetuning and inference on runpod infrastructure☆30May 12, 2026Updated last week
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆137Apr 29, 2026Updated 3 weeks ago
- Machine Learning for Alignment Bootcamp (MLAB).☆33Jan 24, 2022Updated 4 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆13Feb 13, 2023Updated 3 years ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆219Updated this week
- ☆49Updated this week
- Inspect: A framework for large language model evaluations☆2,096Updated this week
- ☆342Jun 19, 2024Updated last year
- ☆1,082May 12, 2026Updated last week
- ☆15Dec 7, 2021Updated 4 years ago
- Mamba support for transformer lens☆20Sep 17, 2024Updated last year
- [NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents☆82Apr 27, 2026Updated 3 weeks ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Turn jitted jax functions back into python source code☆23Dec 16, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆994Aug 14, 2024Updated last year
- Measuring the situational awareness of language models☆41Feb 12, 2024Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- Measuring and Controlling Persona Drift in Language Model Dialogs☆25Feb 26, 2024Updated 2 years ago
- ☆285Updated this week
- ☆20Feb 17, 2023Updated 3 years ago
- [CVPR'25] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data☆17Mar 27, 2025Updated last year
- ☆26Jun 22, 2025Updated 11 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- Tools for running experiments on RL agents in procgen environments☆20Apr 5, 2024Updated 2 years ago
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆26Mar 4, 2025Updated last year
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆15Sep 4, 2024Updated last year
- Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆329Updated this week
- ☆416Aug 21, 2025Updated 9 months ago
- Collection of evals for Inspect AI☆498Updated this week