☆133Oct 16, 2025Updated 4 months ago
Alternatives and similar repositories for RE-Bench
Users that are interested in RE-Bench are comparing it to the libraries listed below
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆134Feb 15, 2026Updated 2 weeks ago
- ☆33Jun 4, 2025Updated 8 months ago
- ☆119Jan 19, 2026Updated last month
- METR Task Standard☆177Feb 3, 2025Updated last year
- The Platform for Self-Improving Code. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other opt…☆30Updated this week
- ☆12Jul 12, 2024Updated last year
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆15Sep 4, 2024Updated last year
- ☆23Oct 15, 2022Updated 3 years ago
- Work in progress! I don't recommend looking at the code right now.☆24Dec 3, 2025Updated 2 months ago
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- ☆13May 7, 2023Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- ☆11Apr 6, 2024Updated last year
- ☆11Jan 3, 2024Updated 2 years ago
- incremental symbol learning for natural language understanding☆10Jun 12, 2023Updated 2 years ago
- Shaping Language Models with Cognitive Insights☆15Feb 29, 2024Updated 2 years ago
- Prompt-Guided Retrieval For Non-Knowledge-Intensive Tasks☆12Sep 1, 2023Updated 2 years ago
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆23Mar 4, 2025Updated 11 months ago
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆13Feb 13, 2023Updated 3 years ago
- Creative AI for Visual Art and Music slides and demos.☆11May 2, 2023Updated 2 years ago
- Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs☆14Feb 10, 2026Updated 2 weeks ago
- [LREC-Coling 2024] PECC: Problem Extraction and Coding Challenges☆14May 30, 2024Updated last year
- The application is a end-user training and evaluation system for standard knowledge graph embedding models. It was developed to optimise …☆18May 30, 2025Updated 9 months ago
- ☆330Jun 19, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆953Aug 14, 2024Updated last year
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆131Feb 21, 2026Updated last week
- ☆944Updated this week
- Inspect: A framework for large language model evaluations☆1,783Updated this week
- ☆23Jan 27, 2026Updated last month
- ☆21Jun 22, 2025Updated 8 months ago
- [CVPR2019] Synthesizing Environment-Aware Activities via Activity Sketches☆13Oct 3, 2023Updated 2 years ago
- ☆13Dec 8, 2022Updated 3 years ago
- ☆12Aug 21, 2024Updated last year
- A python sdk for LLM finetuning and inference on runpod infrastructure☆19Feb 16, 2026Updated last week
- ☆396Aug 21, 2025Updated 6 months ago
- ☆14Aug 29, 2023Updated 2 years ago
- ☆19Sep 16, 2025Updated 5 months ago
- Scripts for pushing models to huggingface repos☆15Sep 11, 2025Updated 5 months ago