siegelz / core-benchLinks
☆47Updated 6 months ago
Alternatives and similar repositories for core-bench
Users that are interested in core-bench are comparing it to the libraries listed below
Sorting:
- ☆126Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆182Updated 6 months ago
- ☆98Updated 4 months ago
- Discovering Data-driven Hypotheses in the Wild☆110Updated 3 months ago
- A benchmark that challenges language models to code solutions for scientific problems☆140Updated last week
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆272Updated this week
- Reproducible, flexible LLM evaluations☆244Updated 2 months ago
- A virtual environment for developing and evaluating automated scientific discovery agents.☆182Updated 6 months ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆100Updated 2 weeks ago
- Evaluation of LLMs on latest math competitions☆162Updated last month
- Collection of evals for Inspect AI☆230Updated this week
- Open source interpretability artefacts for R1.☆158Updated 4 months ago
- Evaluating LLMs with fewer examples☆161Updated last year
- ☆80Updated last week
- AWM: Agent Workflow Memory☆316Updated 7 months ago
- [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆72Updated 3 weeks ago
- Automatic evals for LLMs☆524Updated 2 months ago
- ☆204Updated last year
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆96Updated 9 months ago
- ☆304Updated last year
- ☆270Updated last month
- SWE Arena☆34Updated 2 months ago
- A simple unified framework for evaluating LLMs☆243Updated 5 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆295Updated this week
- Functional Benchmarks and the Reasoning Gap☆88Updated 11 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆74Updated 3 months ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.☆97Updated 5 months ago
- (ACL 2025 Main) Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.019…☆157Updated this week
- Scaling Data for SWE-agents☆399Updated this week
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆116Updated last year