OSU-NLP-Group / ScienceAgentBench
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆53Updated last month
Alternatives and similar repositories for ScienceAgentBench:
Users that are interested in ScienceAgentBench are comparing it to the libraries listed below
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆38Updated 3 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated 11 months ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆35Updated last month
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆69Updated 2 months ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆43Updated this week
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"☆82Updated 4 months ago
- The code implementation of MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models…☆31Updated last year
- Code and Data for "Language Modeling with Editable External Knowledge"☆31Updated 8 months ago
- Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data☆35Updated this week
- ☆40Updated last week
- LLM for Scientific Research Survey☆50Updated 3 weeks ago
- Structured Chemistry Reasoning with Large Language Models☆32Updated 9 months ago
- Official implementation of the ACL 2024: Scientific Inspiration Machines Optimized for Novelty☆74Updated 10 months ago
- Source code for our paper: "Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction A…☆43Updated last year
- Code/data for MARG (multi-agent review generation)☆38Updated 3 months ago
- A benchmark that challenges language models to code solutions for scientific problems☆108Updated this week
- This repository contains ScholarQABench data and evaluation pipeline.☆61Updated 2 weeks ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Updated last year
- The official repo for "TheoremQA: A Theorem-driven Question Answering dataset" (EMNLP 2023)☆27Updated 9 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆73Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆47Updated this week
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.☆52Updated this week
- The Open Source Code for LLM4SD (Large Language Models for Scientific Synthesis, Inference and Explanation)☆32Updated last month
- Codebase accompanying the Summary of a Haystack paper.☆74Updated 5 months ago
- official implementation of paper "Process Reward Model with Q-value Rankings"☆48Updated 2 weeks ago
- EMNLP 2024 "Re-reading improves reasoning in large language models". Simply repeating the question to get bidirectional understanding for…☆24Updated 2 months ago
- Official Implementation of the Baby-AIGS system☆22Updated 2 months ago