siegelz / core-bench
☆30Updated last month
Alternatives and similar repositories for core-bench:
Users that are interested in core-bench are comparing it to the libraries listed below
- ☆60Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆168Updated last month
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆80Updated last week
- A virtual environment for developing and evaluating automated scientific discovery agents.☆143Updated last month
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆72Updated 8 months ago
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆83Updated 4 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆103Updated last year
- ☆72Updated 2 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆129Updated last week
- A benchmark that challenges language models to code solutions for scientific problems☆114Updated this week
- SWE Arena☆31Updated last week
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆36Updated 3 months ago
- [arXiv] EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees☆15Updated last month
- This repository contains ScholarQABench data and evaluation pipeline.☆71Updated last week
- Reproducible, flexible LLM evaluations☆191Updated 3 weeks ago
- ☆57Updated last month
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆74Updated last year
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated last week
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Discovering Data-driven Hypotheses in the Wild☆74Updated 5 months ago
- ☆114Updated 2 months ago
- A simple unified framework for evaluating LLMs☆209Updated last week
- PyTorch library for Active Fine-Tuning☆63Updated 2 months ago
- ☆41Updated 2 weeks ago
- ☆70Updated 5 months ago
- Replicating O1 inference-time scaling laws☆83Updated 4 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆39Updated last month
- Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)☆65Updated 2 months ago