METR/eval-analysis-public

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/METR/eval-analysis-public)

METR / eval-analysis-public

Public repository containing METR's DVC pipeline for eval data analysis

☆303

Alternatives and similar repositories for eval-analysis-public

Users that are interested in eval-analysis-public are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

METR / vivaria
View on GitHub
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆140May 18, 2026Updated 2 months ago
METR / hcast-public
View on GitHub
☆22Jul 6, 2026Updated 2 weeks ago
METR / RE-Bench
View on GitHub
☆145Oct 16, 2025Updated 9 months ago
uvafan / timelines-takeoff-ai-2027
View on GitHub
☆18Dec 10, 2025Updated 7 months ago
METR / public-tasks
View on GitHub
☆129Jun 10, 2026Updated last month
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
METR / inspect-action
View on GitHub
Running UK AISI's Inspect in the Cloud
☆24May 6, 2026Updated 2 months ago
google-deepmind / serial_depth
View on GitHub
☆17Mar 10, 2026Updated 4 months ago
poking-agents / modular-public
View on GitHub
☆34Jun 4, 2025Updated last year
meridianlabs-ai / inspect_viz
View on GitHub
Data visualization for Inspect AI large language model evalutions.
☆21Jul 15, 2026Updated last week
filipgdorm / eco-llm
View on GitHub
☆14Mar 20, 2026Updated 4 months ago
UKGovernmentBEIS / control-arena
View on GitHub
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆213Updated this week
nishadsinghi / sc-genrm-scaling
View on GitHub
[COLM 2025] Official code for "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoni…
☆15Oct 31, 2025Updated 8 months ago
METR / task-standard
View on GitHub
METR Task Standard
☆184Feb 3, 2025Updated last year
epoch-research / training-cost-trends
View on GitHub
☆27Apr 1, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
UKGovernmentBEIS / inspect_ai
View on GitHub
Inspect: A framework for large language model evaluations
☆2,406Updated this week
UKGovernmentBEIS / inspect_evals
View on GitHub
Collection of evals for Inspect AI
☆602Updated this week
meridianlabs-ai / inspect_petri
View on GitHub
An alignment auditing agent capable of quickly exploring alignment hypothesis
☆1,273Updated this week
sambowyer / bayes_evals
View on GitHub
A lightweight library for Bayesian analysis of LLM evals (ICML 2025 Spotlight Position Paper)
☆25May 28, 2025Updated last year
google-deepmind / dangerous-capability-evaluations
View on GitHub
☆73Jun 16, 2026Updated last month
UKGovernmentBEIS / as-evaluation-standard
View on GitHub
A repository that holds templates, examples, and tests to help external parties submit tasks to AISI that conform with the Autonomous Sys…
☆11Jan 16, 2026Updated 6 months ago
UKGovernmentBEIS / hibayes
View on GitHub
☆53May 17, 2026Updated 2 months ago
UKGovernmentBEIS / sandbox_escape_bench
View on GitHub
☆30Jul 7, 2026Updated 2 weeks ago
neilrathi / token-filtering
View on GitHub
Shaping capabilities with token-level pretraining data filtering
☆95Jan 28, 2026Updated 5 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
EleutherAI / attribute
View on GitHub
☆16Nov 14, 2025Updated 8 months ago
centerforaisafety / rli_evaluation_platform
View on GitHub
Public repository for the Remote Labor Index (RLI)
☆75Nov 3, 2025Updated 8 months ago
CritPt-Benchmark / CritPt
View on GitHub
☆84Nov 21, 2025Updated 8 months ago
UKGovernmentBEIS / aisi-sandboxing
View on GitHub
The open-source AISI toolkit for sandboxing agentic evaluations
☆26Aug 7, 2025Updated 11 months ago
HugoFry / mats_sae_training_for_ViTs
View on GitHub
☆25Apr 23, 2024Updated 2 years ago
alexander-turner / attainable-utility-preservation
View on GitHub
☆11Jun 2, 2021Updated 5 years ago
alexisfox7 / PRO-LONG
View on GitHub
☆135Updated this week
joyheyueya / giants
View on GitHub
☆28Jun 1, 2026Updated last month
apartresearch / DarkBench
View on GitHub
Benchmarking Dark Patterns in LLMs (ICLR 2025)
☆18Mar 29, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
princeton-pli / hal-harness
View on GitHub
☆310Jul 1, 2026Updated 3 weeks ago
TransluceAI / observatory
View on GitHub
A toolkit for describing model features and intervening on those features to steer behavior.
☆249Mar 16, 2026Updated 4 months ago
rgreenblatt / model_organism_public
View on GitHub
☆15Jun 17, 2025Updated last year
meridianlabs-ai / inspect_scout
View on GitHub
In-depth analysis of AI agent transcripts.
☆57Updated this week
anadim / llm-benchmark-matrix
View on GitHub
Cited 83-model x 49-benchmark LLM evaluation matrix with 18 matrix completion methods
☆39Feb 25, 2026Updated 5 months ago
safety-research / safety-tooling
View on GitHub
Inference API for many LLMs and other useful tools for empirical research
☆134May 29, 2026Updated last month
arcprize / ARC-AGI-Community-Leaderboard
View on GitHub
☆29Jun 11, 2026Updated last month