siegelz / core-bench
☆36Updated 2 months ago
Alternatives and similar repositories for core-bench
Users that are interested in core-bench are comparing it to the libraries listed below
Sorting:
- ☆74Updated this week
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆85Updated 2 weeks ago
- Complex Function Calling Benchmark.☆99Updated 3 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 7 months ago
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆84Updated 5 months ago
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆72Updated 8 months ago
- SWE Arena☆33Updated last month
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆99Updated last year
- Source code for the collaborative reasoner research project at Meta FAIR.☆40Updated 3 weeks ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆36Updated 4 months ago
- Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT)☆110Updated 3 months ago
- Official Implementation of "Reasoning Language Models: A Blueprint"☆59Updated 3 months ago
- Open source interpretability artefacts for R1.☆109Updated 3 weeks ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆172Updated 2 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 7 months ago
- A small library of LLM judges☆191Updated 2 weeks ago
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated last month
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆118Updated 11 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- A simple unified framework for evaluating LLMs☆209Updated last month
- This repository contains ScholarQABench data and evaluation pipeline.☆71Updated last month
- Train your own SOTA deductive reasoning model☆92Updated 2 months ago
- Verdict is a library for scaling judge-time compute.☆209Updated 2 weeks ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆108Updated 8 months ago
- ☆40Updated 9 months ago
- A benchmark that challenges language models to code solutions for scientific problems☆119Updated this week
- ☆74Updated 2 weeks ago
- ☆43Updated 9 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆153Updated this week
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆40Updated 6 months ago