☆139Oct 16, 2025Updated 7 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆34Jun 4, 2025Updated last year
- METR Task Standard☆180Feb 3, 2025Updated last year
- ☆124Jan 19, 2026Updated 4 months ago
- ☆14Jul 12, 2024Updated last year
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆143May 6, 2026Updated last month
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Work in progress! I don't recommend looking at the code right now.☆24May 29, 2026Updated last week
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- A python sdk for LLM finetuning and inference on runpod infrastructure☆30May 12, 2026Updated 3 weeks ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆139Apr 29, 2026Updated last month
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆14Feb 13, 2023Updated 3 years ago
- ☆49May 17, 2026Updated 3 weeks ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆220Jun 1, 2026Updated last week
- Inspect: A framework for large language model evaluations☆2,165Jun 4, 2026Updated last week
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆342Jun 19, 2024Updated last year
- ☆1,120Updated this week
- Mamba support for transformer lens☆20Sep 17, 2024Updated last year
- [NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents☆85Apr 27, 2026Updated last month
- Representation Engineering: A Top-Down Approach to AI Transparency☆1,004Aug 14, 2024Updated last year
- Measuring the situational awareness of language models☆41Feb 12, 2024Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- Measuring and Controlling Persona Drift in Language Model Dialogs☆25Feb 26, 2024Updated 2 years ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- ☆293Jun 2, 2026Updated last week
- ☆20Feb 17, 2023Updated 3 years ago
- ☆26Jun 22, 2025Updated 11 months ago
- Tools for running experiments on RL agents in procgen environments☆20Apr 5, 2024Updated 2 years ago
- Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆350Jun 3, 2026Updated last week
- ☆419Aug 21, 2025Updated 9 months ago
- ☆25Apr 1, 2026Updated 2 months ago
- Collection of evals for Inspect AI☆529Updated this week
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆47May 31, 2024Updated 2 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year
- Code for Voice Jailbreak Attacks Against GPT-4o.☆38May 31, 2024Updated 2 years ago
- ☆12Aug 21, 2024Updated last year
- [ACL 2025] LongSafety: Evaluating Long-Context Safety of Large Language Models☆16Jun 18, 2025Updated 11 months ago
- Benchmarking Goal-Oriented Software Engineering☆165May 5, 2026Updated last month
- Inference API for many LLMs and other useful tools for empirical research☆124May 29, 2026Updated last week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,566Apr 24, 2026Updated last month