ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆158Feb 27, 2026Updated last week
Alternatives and similar repositories for control-arena
Users that are interested in control-arena are comparing it to the libraries listed below
Sorting:
- Inference API for many LLMs and other useful tools for empirical research☆107Feb 27, 2026Updated last week
- ☆35May 9, 2025Updated 9 months ago
- ☆20May 25, 2024Updated last year
- ☆12Jul 12, 2024Updated last year
- ☆21Jun 22, 2025Updated 8 months ago
- ☆25Nov 11, 2025Updated 3 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆134Feb 15, 2026Updated 2 weeks ago
- ☆33Jul 9, 2025Updated 7 months ago
- ☆18Feb 25, 2026Updated last week
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- Official Code for What Makes and Breaks Safety Fine-tuning? A Mechanistic Study (NeurIPS 2024)☆12Oct 31, 2024Updated last year
- A library for efficient patching and automatic circuit discovery.☆90Dec 31, 2025Updated 2 months ago
- Tools for studying developmental interpretability in neural networks.☆127Dec 30, 2025Updated 2 months ago
- Code repo for the model organisms and convergent directions of EM papers.☆53Sep 22, 2025Updated 5 months ago
- ☆33Jun 4, 2025Updated 9 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆836Updated this week
- ☆35Feb 20, 2025Updated last year
- A tiny easily hackable implementation of a feature dashboard.☆15Oct 21, 2025Updated 4 months ago
- ☆20Nov 15, 2024Updated last year
- Code for Tangent Model Composition for Ensembling and Continual Fine-tuning (ICCV 2023) and Tangent Transformers for Composition, Privacy…☆13May 14, 2024Updated last year
- ☆28Sep 23, 2025Updated 5 months ago
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- Reimplementation of https://github.com/montemac/algebraic_value_editing in pure PyTorch for efficiency on large models☆11Jun 28, 2023Updated 2 years ago
- General-Sum variant of the game Diplomacy for evaluating AIs.☆34Apr 2, 2024Updated last year
- Stochastic Parameter Decomposition☆66Updated this week
- Decoder only transformer, built from scratch with PyTorch☆33Oct 22, 2023Updated 2 years ago
- A toolbox with the goal of speeding up research on bargaining in MARL (cooperation problems in MARL).☆32Sep 29, 2022Updated 3 years ago
- ☆330Jul 2, 2024Updated last year
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- Experimental LLM interface exploring new ways to use AI to improve human thinking☆19Feb 27, 2026Updated last week
- Training Sparse Autoencoders on Language Models☆1,233Feb 27, 2026Updated last week
- ☆273Oct 1, 2024Updated last year
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆47May 31, 2024Updated last year
- ☆18Feb 4, 2025Updated last year
- ☆87Oct 8, 2025Updated 4 months ago
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- Minimum Description Length probing for neural network representations☆20Jan 28, 2025Updated last year
- A library for mechanistic interpretability of GPT-style language models☆3,133Updated this week