siegelz / core-benchLinks
☆41Updated 4 months ago
Alternatives and similar repositories for core-bench
Users that are interested in core-bench are comparing it to the libraries listed below
Sorting:
- ☆97Updated 2 weeks ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆91Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- ☆92Updated 2 months ago
- Official Code Repository for the paper "Distilling LLM Agent into Small Models with Retrieval and Code Tools"☆115Updated last month
- Source code for the collaborative reasoner research project at Meta FAIR.☆95Updated 3 months ago
- ☆185Updated 11 months ago
- A benchmark that challenges language models to code solutions for scientific problems☆127Updated last week
- A virtual environment for developing and evaluating automated scientific discovery agents.☆163Updated 4 months ago
- Discovering Data-driven Hypotheses in the Wild☆99Updated last month
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆58Updated 4 months ago
- Codebase accompanying the Summary of a Haystack paper.☆79Updated 9 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆210Updated this week
- Functional Benchmarks and the Reasoning Gap☆88Updated 9 months ago
- Train your own SOTA deductive reasoning model☆99Updated 4 months ago
- Attribute (or cite) statements generated by LLMs back to in-context information.☆245Updated 9 months ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- A small library of LLM judges☆232Updated 3 weeks ago
- ☆94Updated last month
- Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation☆44Updated last year
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆113Updated 10 months ago
- Evaluation of LLMs on latest math competitions☆142Updated 2 months ago
- A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.☆163Updated this week
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆90Updated 7 months ago
- SWE Arena☆34Updated last week
- Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT)☆115Updated 5 months ago
- Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)☆133Updated 2 months ago
- (ACL 2025 Main) Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.019…☆132Updated this week
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆66Updated last month
- A simplified implementation for experimenting with RLVR on GSM8K, This repository provides a starting point for exploring reasoning.☆113Updated 5 months ago