forecastingresearch / forecastbenchLinks
A dynamic forecasting benchmark for LLMs
☆26Updated this week
Alternatives and similar repositories for forecastbench
Users that are interested in forecastbench are comparing it to the libraries listed below
Sorting:
- Governance of the Commons Simulation (GovSim)☆55Updated 5 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆193Updated 8 months ago
- ☆72Updated last year
- ☆92Updated 2 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆99Updated 2 weeks ago
- ☆14Updated 2 months ago
- ☆171Updated 4 months ago
- ☆23Updated 10 months ago
- Forecastbench Datasets, updated nightly☆12Updated this week
- We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.☆118Updated last year
- ☆97Updated 2 weeks ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆102Updated 3 weeks ago
- ☆99Updated 4 months ago
- NAACL 2024. Code & Dataset for "🌁 Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistake…☆41Updated 11 months ago
- ☆137Updated 8 months ago
- A simple evaluation of generative language models and safety classifiers.☆57Updated 11 months ago
- ☆24Updated 8 months ago
- ☆283Updated last year
- Inference-time scaling for LLMs-as-a-judge.☆251Updated this week
- Forecasting with LLMs☆49Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆82Updated last year
- CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.☆48Updated 8 months ago
- ☆83Updated last year
- Psych 290Q S23 @ UC Berkeley: Large Language Models and Cognitive Science☆18Updated last year
- Learning to route instances for Human vs AI Feedback (ACL 2025 Main)☆23Updated 2 months ago
- Synthetic data derived by templating, few shot prompting, transformations on public domain corpora, and monte carlo tree search.☆32Updated 4 months ago
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆71Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆109Updated last year
- The Prism Alignment Project☆79Updated last year
- ☆52Updated last year