☆134Oct 16, 2025Updated 5 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆135Feb 15, 2026Updated last month
- ☆33Jun 4, 2025Updated 10 months ago
- METR Task Standard☆179Feb 3, 2025Updated last year
- ☆120Jan 19, 2026Updated 2 months ago
- Production-Grade Autoresearch. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other optimizabl…☆41Updated this week
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- ☆13Jul 12, 2024Updated last year
- ☆23Oct 15, 2022Updated 3 years ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆139Apr 3, 2026Updated last week
- Work in progress! I don't recommend looking at the code right now.☆24Apr 1, 2026Updated last week
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆134Mar 5, 2026Updated last month
- Machine Learning for Alignment Bootcamp (MLAB).☆33Jan 24, 2022Updated 4 years ago
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆13Feb 13, 2023Updated 3 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆217Updated this week
- Inspect: A framework for large language model evaluations☆1,890Updated this week
- [NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents☆76Mar 16, 2026Updated 3 weeks ago
- ☆1,021Mar 29, 2026Updated last week
- ☆337Jun 19, 2024Updated last year
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆15Dec 7, 2021Updated 4 years ago
- Mamba support for transformer lens☆19Sep 17, 2024Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆115Jun 13, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Representation Engineering: A Top-Down Approach to AI Transparency☆978Aug 14, 2024Updated last year
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆254Updated this week
- ☆256Updated this week
- Measuring and Controlling Persona Drift in Language Model Dialogs☆23Feb 26, 2024Updated 2 years ago
- ☆20Feb 17, 2023Updated 3 years ago
- [CVPR'25] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data☆17Mar 27, 2025Updated last year
- ☆24Jun 22, 2025Updated 9 months ago
- Collection of evals for Inspect AI☆424Updated this week
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- Tools for running experiments on RL agents in procgen environments☆20Apr 5, 2024Updated 2 years ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆15Sep 4, 2024Updated last year
- Benchmarking Goal-Oriented Software Engineering☆128Jan 7, 2026Updated 3 months ago
- ☆408Aug 21, 2025Updated 7 months ago
- ☆24Apr 1, 2026Updated last week
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆47May 31, 2024Updated last year