allenai/discoverybench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/discoverybench)

allenai / discoverybench

Discovering Data-driven Hypotheses in the Wild

☆131

Alternatives and similar repositories for discoverybench

Users that are interested in discoverybench are comparing it to the libraries listed below

Sorting:

OSU-NLP-Group / ScienceAgentBench
View on GitHub
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆124Aug 26, 2025Updated 6 months ago
behavioral-data / BLADE
View on GitHub
[EMNLP 2024 Findings] Benchmarking Language Model Agents for Data-Driven Science
☆34Oct 25, 2024Updated last year
allenai / discoveryworld
View on GitHub
A virtual environment for developing and evaluating automated scientific discovery agents.
☆200Mar 10, 2025Updated 11 months ago
allenai / hci-alt-texts
View on GitHub
Dataset and annotations for ASSETS 2022 publication
☆12Oct 6, 2022Updated 3 years ago
snap-stanford / BioDiscoveryAgent
View on GitHub
BioDiscoveryAgent is an LLM-based AI agent for closed-loop design of genetic perturbation experiments
☆97Jul 6, 2025Updated 7 months ago
ZonglinY / MOOSE
View on GitHub
[ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …
☆42Oct 28, 2024Updated last year
allenai / neurodiscoverybench
View on GitHub
☆16Jan 29, 2026Updated last month
allenai / chime
View on GitHub
Repository containing dataset, models and code associated with the CHIME project
☆17Aug 22, 2024Updated last year
snap-stanford / POPPER
View on GitHub
Automated Hypothesis Testing with Agentic Sequential Falsifications
☆246May 14, 2025Updated 9 months ago
kylehamilton / JamoviMeta
View on GitHub
Meta-Analysis for JAMOVI
☆11Nov 11, 2017Updated 8 years ago
allenai / ScienceWorld
View on GitHub
ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
☆337Dec 3, 2025Updated 3 months ago
ZhaozhiQIAN / Single-Cause-Perturbation-NeurIPS-2021
View on GitHub
Code for Estimating Multi-cause Treatment Effects via Single-cause Perturbation (NeurIPS 2021)
☆14Jan 5, 2022Updated 4 years ago
nealhaddaway / predicter
View on GitHub
A Tool to Estimate the Time Needed to Conduct a Systematic Review or Systematic Map
☆16Jul 29, 2022Updated 3 years ago
GAIR-NLP / lm-open-science-evaluation
View on GitHub
Reproducible and flexible LLM evaluations for scientific reasoning.
☆26Jul 23, 2025Updated 7 months ago
scicode-bench / SciCode
View on GitHub
A benchmark that challenges language models to code solutions for scientific problems
☆176Updated this week
harvard-edge / dataperf-speech-example
View on GitHub
Example workflow for our data-centric speech benchmark
☆17Jul 6, 2023Updated 2 years ago
allenai / super-benchmark
View on GitHub
☆49Apr 4, 2025Updated 11 months ago
softnanolab / boileroom
View on GitHub
Protein prediction models implemented with Modal
☆30Feb 22, 2026Updated last week
BatsResearch / ex2
View on GitHub
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
☆17Apr 4, 2024Updated last year
qiancheng0 / EscapeBench
View on GitHub
This is the repository for paper EscapeBench: Pushing Language Models to Think Outside the Box
☆18Dec 19, 2024Updated last year
WecoAI / weco-cli
View on GitHub
The Platform for Self-Improving Code. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other opt…
☆30Updated this week
allenai / WildBench
View on GitHub
Benchmarking LLMs with Challenging Tasks from Real Users
☆246Nov 3, 2024Updated last year
allenai / marg-reviewer
View on GitHub
Code/data for MARG (multi-agent review generation)
☆59Sep 30, 2025Updated 5 months ago
Future-House / BixBench
View on GitHub
Benchmark for LLM-based Agents in Computational Biology
☆72Oct 6, 2025Updated 4 months ago
Princeton-RL / CRTR
View on GitHub
Official code for the paper "Contrastive Representations for Temporal Reasoning".
☆52Nov 25, 2025Updated 3 months ago
flbbb / locost-summarization
View on GitHub
☆29Mar 22, 2024Updated last year
microsoft / SmartPlay
View on GitHub
SmartPlay is a benchmark for Large Language Models (LLMs). Uses a variety of games to test various important LLM capabilities as agents. …
☆146Apr 11, 2024Updated last year
princeton-pli / hal-harness
View on GitHub
☆229Updated this week
METR / RE-Bench
View on GitHub
☆133Oct 16, 2025Updated 4 months ago
SalesforceAIResearch / swecomm
View on GitHub
☆28Nov 10, 2025Updated 3 months ago
asaparov / prontoqa
View on GitHub
Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
☆155Sep 9, 2025Updated 5 months ago
snap-stanford / MLAgentBench
View on GitHub
☆330Jun 19, 2024Updated last year
heaplax / ARMAP
View on GitHub
☆27Jun 5, 2025Updated 8 months ago
gist-ailab / block-selection-for-OOD-detection
View on GitHub
This is an official implementation for "Block Selection Method for Using Feature Norm in Out-of-distribution Detection", CVPR 2023.
☆24May 21, 2024Updated last year
Zhiyuan-Zeng / EvalTree
View on GitHub
[COLM 2025] EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
☆31Jul 11, 2025Updated 7 months ago
PAIR-code / pretraining-tda
View on GitHub
☆32Feb 11, 2025Updated last year
OpenDFM / SciEval
View on GitHub
[AAAI 2024] SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
☆30Aug 6, 2024Updated last year
thomasnormal / fewshot
View on GitHub
☆29Oct 24, 2025Updated 4 months ago
SynthLabsAI / big-math
View on GitHub
A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
☆72Feb 25, 2025Updated last year