forecastingresearch / forecastbenchLinks

A dynamic forecasting benchmark for LLMs

☆46

Alternatives and similar repositories for forecastbench

Users that are interested in forecastbench are comparing it to the libraries listed below

Sorting:

forecastingresearch / forecastbench-datasets
Forecastbench Datasets, updated nightly
☆20Updated this week
AgentTorch / AgentTorch
large population models
☆460Updated this week
giorgiopiatti / GovSim
Governance of the Commons Simulation (GovSim)
☆61Updated 10 months ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆214Updated last year
nikitadhawan / natural
☆43Updated last year
UKGovernmentBEIS / hibayes
☆36Updated last month
emergent-misalignment / emergent-misalignment
☆228Updated last month
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆163Updated 7 months ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆122Updated last year
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆312Updated 3 weeks ago
SakanaAI / AI-Scientist-ICLR2025-Workshop-Experiment
☆273Updated 7 months ago
vinid / NegotiationArena
☆79Updated last year
marketagents-ai / MarketAgents
An agent orchestration framework for economic agents
☆108Updated 3 months ago
bethgelab / CiteME
CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.
☆48Updated 3 weeks ago
dannyallover / llm_forecasting
Forecasting with LLMs
☆55Updated last year
GoodAI / goodai-ltm-benchmark
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you…
☆80Updated 11 months ago
haizelabs / bijection-learning
☆26Updated last year
aounon / llm-rank-optimizer
☆114Updated 3 months ago
rosewang2008 / bridge
NAACL 2024. Code & Dataset for "🌁 Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistake…
☆45Updated last year
interp-reasoning / thought-anchors
⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.
☆92Updated last month
KihoPark / LLM_Categorical_Hierarchical_Representations
☆111Updated 9 months ago
centerforaisafety / emergent-values
Code for "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs"
☆83Updated 9 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
causalNLP / cladder
We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.
☆133Updated last year
josh-ashkinaze / plurals
Plurals: A System for Guiding LLMs Via Simulated Social Ensembles
☆28Updated last week
METR / task-standard
METR Task Standard
☆168Updated 9 months ago
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆121Updated 2 weeks ago
lisadunlap / VibeCheck
Automated Qualitative Analysis of LLMs (ICLR 2025)
☆51Updated 4 months ago
METR / RE-Bench
☆119Updated last month
SalesforceAIResearch / CRMArena
Official Repo for CRMArena and CRMArena-Pro
☆126Updated 3 weeks ago