Public repository containing METR's DVC pipeline for eval data analysis
☆224Feb 13, 2026Updated 3 weeks ago
Alternatives and similar repositories for eval-analysis-public
Users that are interested in eval-analysis-public are comparing it to the libraries listed below
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆134Feb 15, 2026Updated 2 weeks ago
- NeurIPS 2024: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation☆13May 24, 2025Updated 9 months ago
- ☆23Jan 27, 2026Updated last month
- ☆11Mar 13, 2023Updated 2 years ago
- An Inspect extension for agentic cyber evaluations☆22Feb 24, 2026Updated last week
- Original code base for On Pretraining Data Diversity for Self-Supervised Learning☆14Dec 30, 2024Updated last year
- Experiments in applying interpretability techniques to learned reward functions.☆10Dec 11, 2020Updated 5 years ago
- Config files for setting up Multitenant Kubeflow on AWS with spot instances☆10Sep 15, 2020Updated 5 years ago
- A dashboard that show the relationships between urban spaces and their networks of design, production and consumption with Maker initiati…☆13Nov 10, 2017Updated 8 years ago
- ☆65Feb 20, 2026Updated 2 weeks ago
- A private homebrew tap to install the Graphite CLI tools.☆14Feb 20, 2026Updated 2 weeks ago
- Vision Large Language Models trained on M3IT instruction tuning dataset☆17Aug 16, 2023Updated 2 years ago
- Python package to compute interaction indices that extend the Shapley Value. AISTATS 2023.☆19Sep 25, 2023Updated 2 years ago
- Byte-sized text games for code generation tasks on virtual environments☆20Jul 8, 2024Updated last year
- Official code for the paper: "Metadata Archaeology"☆19May 10, 2023Updated 2 years ago
- Python interface for the Quantum Exact Simulation Toolkit (QuEST)☆20Oct 29, 2025Updated 4 months ago
- METR Task Standard☆177Feb 3, 2025Updated last year
- Ranking LLMs on agentic tasks☆216Nov 18, 2025Updated 3 months ago
- A collection of examples leveraging the ndarray ecosystem.☆17Jan 6, 2020Updated 6 years ago
- Get exam-ready with scenario-based questions and detailed answers generated by AI for the AWS Solutions Architect certification.☆19Apr 2, 2023Updated 2 years ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- A formalisation of Cartesian Frames, a perspective on embedded agency, in the HOL theorem prover.☆20Dec 20, 2021Updated 4 years ago
- ☆45Feb 13, 2026Updated 3 weeks ago
- ☆25Jan 8, 2025Updated last year
- Training Proactive and Personalized LLM Agents☆102Jan 20, 2026Updated last month
- ☆87Jul 30, 2024Updated last year
- Code for paper "Point and Ask: Incorporating Pointing into Visual Question Answering"☆19Oct 4, 2022Updated 3 years ago
- Code for the Ask4Help project☆22Nov 24, 2022Updated 3 years ago
- Code for the paper: "No Zero-Shot Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance" [NeurI…☆94Apr 29, 2024Updated last year
- ☆22Sep 9, 2021Updated 4 years ago
- A python interface to the QuEST quantum simulator (cffi based)☆20Feb 2, 2024Updated 2 years ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,439Jul 18, 2025Updated 7 months ago
- Harness for running and evaluating AI agents against RL environments☆120Updated this week
- ☆50Oct 29, 2023Updated 2 years ago
- NEVIS'22: Benchmarking the next generation of never-ending learners☆102Dec 13, 2022Updated 3 years ago
- ☆237Updated this week
- ☆133Oct 16, 2025Updated 4 months ago
- Accompanying repo for CVPRW'24: Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs☆27May 24, 2025Updated 9 months ago
- A semantic search system for Airbnb listings in Stockholm, built with Superlinked and Qdrant. It leverages multi-attribute vector search …☆24Jul 1, 2025Updated 8 months ago