Public repository containing METR's DVC pipeline for eval data analysis
☆243Mar 6, 2026Updated 3 weeks ago
Alternatives and similar repositories for eval-analysis-public
Users that are interested in eval-analysis-public are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- An Inspect extension for agentic cyber evaluations☆24Feb 24, 2026Updated last month
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆135Feb 15, 2026Updated last month
- NeurIPS 2024: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation☆13May 24, 2025Updated 10 months ago
- ☆14Mar 20, 2026Updated last week
- ☆24Jan 27, 2026Updated 2 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ☆22May 25, 2024Updated last year
- ☆48Mar 19, 2026Updated last week
- Optimally-weighted herding is Bayesian Quadrature☆16Jul 8, 2016Updated 9 years ago
- A Kubernetes sandbox environment for use with inspect_ai☆28Mar 19, 2026Updated last week
- [ICML 2025] EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning☆16May 24, 2025Updated 10 months ago
- ☆33Jun 4, 2025Updated 9 months ago
- Code, Data and Red Teaming for ZeroBench☆59Dec 23, 2025Updated 3 months ago
- ☆12Feb 11, 2026Updated last month
- ☆66Feb 20, 2026Updated last month
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- phylogenomic analysis of restriction sites & reverse genetic systems☆13Oct 18, 2022Updated 3 years ago
- UQ: Assessing Language Models on Unsolved Questions☆30Aug 26, 2025Updated 7 months ago
- Accompanying repo for CVPRW'24: Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs☆27May 24, 2025Updated 10 months ago
- Python package to download and use the SSB datasets☆11Aug 3, 2023Updated 2 years ago
- Repository to create traveling waves integrate special information through time☆56Aug 8, 2025Updated 7 months ago
- Indexing framework designed for the automated creation of structured knowledge bases in Azure AI Search☆14Jun 18, 2025Updated 9 months ago
- Python package to compute interaction indices that extend the Shapley Value. AISTATS 2023.☆19Sep 25, 2023Updated 2 years ago
- METR Task Standard☆178Feb 3, 2025Updated last year
- Inspect: A framework for large language model evaluations☆1,851Updated this week
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Reward Learning by Simulating the Past☆46May 9, 2019Updated 6 years ago
- Accompanying repo for NeurIPSW'23: GPT4GEO: How a Language Model Sees the World's Geography☆27May 24, 2025Updated 10 months ago
- Vision Large Language Models trained on M3IT instruction tuning dataset☆17Aug 16, 2023Updated 2 years ago
- Youtube Too Long Didn't Watch☆13Sep 2, 2024Updated last year
- The official implementation of the paper "Self-Updatable Large Language Models by Integrating Context into Model Parameters"☆15May 18, 2025Updated 10 months ago
- Official code for the paper: "Metadata Archaeology"☆19May 10, 2023Updated 2 years ago
- Byte-sized text games for code generation tasks on virtual environments☆20Jul 8, 2024Updated last year
- ☆14Oct 30, 2024Updated last year
- Library for text-to-text regression, applicable to any input string representation and allows pretraining and fine-tuning over multiple r…☆327Mar 8, 2026Updated 2 weeks ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- A blog on AI, personal development, and living a good life.☆36Updated this week
- Experiments in applying interpretability techniques to learned reward functions.☆10Dec 11, 2020Updated 5 years ago
- Code for the paper "Refining Language Model with Compositional Explanation" (NeurIPS 2021)☆11Oct 25, 2021Updated 4 years ago
- ☆18Mar 13, 2026Updated 2 weeks ago
- An R package for simulating line lists☆10Mar 14, 2026Updated last week
- ☆10Feb 9, 2026Updated last month
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,439Jul 18, 2025Updated 8 months ago