☆134Oct 16, 2025Updated 5 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆135Feb 15, 2026Updated last month
- METR Task Standard☆178Feb 3, 2025Updated last year
- ☆121Jan 19, 2026Updated 2 months ago
- The Platform for Self-Improving Code. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other opt…☆36Updated this week
- ☆23Oct 15, 2022Updated 3 years ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆134Feb 21, 2026Updated last month
- Work in progress! I don't recommend looking at the code right now.☆24Dec 3, 2025Updated 3 months ago
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆132Mar 5, 2026Updated 2 weeks ago
- Machine Learning for Alignment Bootcamp (MLAB).☆32Jan 24, 2022Updated 4 years ago
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆13Feb 13, 2023Updated 3 years ago
- ☆48Feb 13, 2026Updated last month
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆217Updated this week
- Inspect: A framework for large language model evaluations☆1,841Updated this week
- An Aragon OSx simplified UI template for your custom DAO☆10Jan 7, 2025Updated last year
- ☆1,000Mar 14, 2026Updated last week
- [NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents☆74Updated this week
- ☆331Jun 19, 2024Updated last year
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆15Dec 7, 2021Updated 4 years ago
- ☆23Jun 22, 2025Updated 9 months ago
- Mamba support for transformer lens☆19Sep 17, 2024Updated last year
- Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆229Mar 10, 2026Updated last week
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆965Aug 14, 2024Updated last year
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- ☆247Updated this week
- ☆20Feb 17, 2023Updated 3 years ago
- [CVPR'25] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data☆17Mar 27, 2025Updated 11 months ago
- Tools for running experiments on RL agents in procgen environments☆20Apr 5, 2024Updated last year
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆25Mar 4, 2025Updated last year
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆15Sep 4, 2024Updated last year
- ☆403Aug 21, 2025Updated 7 months ago
- Benchmarking Goal-Oriented Software Engineering☆122Jan 7, 2026Updated 2 months ago
- ☆24Jan 27, 2026Updated last month
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆47May 31, 2024Updated last year
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year