forecastingresearch / forecastbenchLinks
A dynamic forecasting benchmark for LLMs
☆27Updated this week
Alternatives and similar repositories for forecastbench
Users that are interested in forecastbench are comparing it to the libraries listed below
Sorting:
- Forecastbench Datasets, updated nightly☆12Updated this week
- Governance of the Commons Simulation (GovSim)☆56Updated 6 months ago
- Forecasting with LLMs☆49Updated last year
- ☆182Updated 5 months ago
- ☆43Updated 9 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆103Updated last week
- ☆104Updated 2 months ago
- Code for "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs"☆54Updated 5 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆195Updated 9 months ago
- We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.☆119Updated last year
- ☆23Updated 11 months ago
- ☆95Updated 3 months ago
- Inference-time scaling for LLMs-as-a-judge.☆272Updated 3 weeks ago
- Open source interpretability artefacts for R1.☆157Updated 3 months ago
- ☆72Updated last year
- ☆136Updated 4 months ago
- Interaction-first method for generating demonstrations for web-agents on any website☆43Updated 3 months ago
- Plurals: A System for Guiding LLMs Via Simulated Social Ensembles☆24Updated last month
- CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.☆48Updated 9 months ago
- ☆289Updated last year
- ☆421Updated 8 months ago
- large population models☆390Updated 2 weeks ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆209Updated last week
- Steer LLM outputs towards a certain topic/subject and enhance response capabilities using activation engineering by adding steering vecto…☆242Updated 5 months ago
- ⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.☆62Updated this week
- Functional Benchmarks and the Reasoning Gap☆88Updated 10 months ago
- Data exports from select "open data" Polis conversations☆39Updated 10 months ago
- Synthetic data derived by templating, few shot prompting, transformations on public domain corpora, and monte carlo tree search.☆32Updated 5 months ago
- LLM Attributor: Attribute LLM's Generated Text to Training Data☆53Updated last year
- ☆300Updated last year