siegelz / core-benchLinks

☆48

Alternatives and similar repositories for core-bench

Users that are interested in core-bench are comparing it to the libraries listed below

Sorting:

princeton-pli / hal-harness
☆136Updated this week
scicode-bench / SciCode
A benchmark that challenges language models to code solutions for scientific problems
☆144Updated this week
facebookresearch / meta-agents-research-environments
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…
☆282Updated 2 weeks ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆184Updated 7 months ago
SALT-NLP / collaborative-gym
Framework and toolkits for building and evaluating collaborative agents that can work together with humans.
☆101Updated last week
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆328Updated 8 months ago
LiqiangJing / DSBench
[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?
☆76Updated last month
ServiceNow / TapeAgents
TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle
☆297Updated this week
OSU-NLP-Group / ScienceAgentBench
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆103Updated last month
aymeric-roucher / agent_reasoning_benchmark
🔧 Compare how Agent systems perform on several benchmarks. 📊🚀
☆102Updated 2 months ago
kohjingyu / search-agents
Code for the paper 🌳 Tree Search for Language Model Agents
☆217Updated last year
METR / RE-Bench
☆109Updated 5 months ago
ServiceNow / AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and re…
☆417Updated this week
BigComputer-Project / SWE-Arena
SWE Arena
☆34Updated 3 months ago
microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆168Updated last week
ServiceNow / WorkArena
WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?
☆210Updated 3 weeks ago
allenai / discoverybench
Discovering Data-driven Hypotheses in the Wild
☆113Updated 4 months ago
Nardien / agent-distillation
Official Code Repository for the paper "Distilling LLM Agent into Small Models with Retrieval and Code Tools"
☆156Updated 2 months ago
snap-stanford / MLAgentBench
☆307Updated last year
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆283Updated this week
allenai / discoveryworld
A virtual environment for developing and evaluating automated scientific discovery agents.
☆188Updated 7 months ago
apple / ToolSandbox
☆215Updated last year
facebookresearch / collaborative-reasoner
Source code for the collaborative reasoner research project at Meta FAIR.
☆102Updated 5 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆252Updated 3 months ago
multi-agent-systems-failure-taxonomy / MAST
☆280Updated 2 months ago
ulab-uiuc / MARBLE
(ACL 2025 Main) Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.019…
☆169Updated last week
mlfoundations / evalchemy
Automatic evals for LLMs
☆539Updated 3 months ago
SWE-bench / SWE-smith
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆418Updated this week
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆248Updated 2 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆106Updated 2 months ago