siegelz / core-benchLinks
☆54Updated last week
Alternatives and similar repositories for core-bench
Users that are interested in core-bench are comparing it to the libraries listed below
Sorting:
- ☆191Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆189Updated 8 months ago
- ☆119Updated last month
- A benchmark that challenges language models to code solutions for scientific problems☆154Updated last week
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.☆112Updated this week
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆111Updated 3 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆216Updated last year
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…☆377Updated 2 weeks ago
- Discovering Data-driven Hypotheses in the Wild☆119Updated 5 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆299Updated last month
- AWM: Agent Workflow Memory☆365Updated 10 months ago
- [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆86Updated 3 months ago
- Collection of evals for Inspect AI☆290Updated last week
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆319Updated last month
- Evaluation of LLMs on latest math competitions☆197Updated last month
- Open source interpretability artefacts for R1.☆164Updated 7 months ago
- A virtual environment for developing and evaluating automated scientific discovery agents.☆191Updated 8 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆218Updated 5 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆209Updated 6 months ago
- SWE Arena☆35Updated 4 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆220Updated last month
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆234Updated 4 months ago
- Functional Benchmarks and the Reasoning Gap☆90Updated last year
- Reproducible, flexible LLM evaluations☆293Updated 2 weeks ago
- ☆300Updated 4 months ago
- ☆124Updated 9 months ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆110Updated 7 months ago
- ☆92Updated last month
- ☆226Updated 9 months ago
- ☆87Updated this week