UKGovernmentBEIS / control-arenaLinks

ControlArena is a collection of settings, model organisms and protocols - for running control experiments.

☆115

Alternatives and similar repositories for control-arena

Users that are interested in control-arena are comparing it to the libraries listed below

Sorting:

METR / task-standard
METR Task Standard
☆163Updated 8 months ago
safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆77Updated last week
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆272Updated this week
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆296Updated 10 months ago
LRudL / evalugator
(Model-written) LLM evals library
☆18Updated 10 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆248Updated last year
redwoodresearch / alignment_faking_public
☆79Updated 3 weeks ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆224Updated 10 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆127Updated 8 months ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆209Updated 11 months ago
saprmarks / feature-circuits
☆192Updated 2 weeks ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆278Updated last year
TransluceAI / docent
☆57Updated last month
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆120Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆221Updated this week
google-deepmind / dangerous-capability-evaluations
☆61Updated last month
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆112Updated 4 months ago
Butanium / nnterp
Unified access to Large Language Model modules using NNsight
☆52Updated last week
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆230Updated 2 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆238Updated last year
jacobdunefsky / transcoder_circuits
☆183Updated 11 months ago
emergent-misalignment / emergent-misalignment
☆222Updated 7 months ago
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆211Updated this week
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆74Updated 2 years ago
METR / public-tasks
☆104Updated 2 weeks ago
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆118Updated last week
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆191Updated last year
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆692Updated this week
safety-research / safety-examples
☆19Updated last week
amack315 / unsupervised-steering-vectors
☆36Updated last year