☆136Oct 16, 2025Updated 6 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆136Feb 15, 2026Updated 2 months ago
- METR Task Standard☆179Feb 3, 2025Updated last year
- ☆121Jan 19, 2026Updated 3 months ago
- ☆14Jul 12, 2024Updated last year
- ☆23Oct 15, 2022Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆140Apr 8, 2026Updated 3 weeks ago
- Work in progress! I don't recommend looking at the code right now.☆23Updated this week
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆13May 7, 2023Updated 2 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- A python sdk for LLM finetuning and inference on runpod infrastructure☆27Apr 23, 2026Updated last week
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆136Updated this week
- Machine Learning for Alignment Bootcamp (MLAB).☆33Jan 24, 2022Updated 4 years ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆218Updated this week
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Inspect: A framework for large language model evaluations☆1,974Updated this week
- An Aragon OSx simplified UI template for your custom DAO☆10Jan 7, 2025Updated last year
- ☆338Jun 19, 2024Updated last year
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆15Dec 7, 2021Updated 4 years ago
- Mamba support for transformer lens☆19Sep 17, 2024Updated last year
- Turn jitted jax functions back into python source code☆23Dec 16, 2024Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆115Jun 13, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆989Aug 14, 2024Updated last year
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Measuring the situational awareness of language models☆41Feb 12, 2024Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- ☆270Apr 21, 2026Updated last week
- Measuring and Controlling Persona Drift in Language Model Dialogs☆24Feb 26, 2024Updated 2 years ago
- ☆20Feb 17, 2023Updated 3 years ago
- Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆293Updated this week
- [CVPR'25] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data☆17Mar 27, 2025Updated last year
- ☆26Jun 22, 2025Updated 10 months ago
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Tools for running experiments on RL agents in procgen environments☆20Apr 5, 2024Updated 2 years ago
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆26Mar 4, 2025Updated last year
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆15Sep 4, 2024Updated last year
- ☆412Aug 21, 2025Updated 8 months ago
- Collection of evals for Inspect AI☆466Updated this week
- ☆25Apr 1, 2026Updated last month
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year