adamjermyn / toy_model_interpretability
☆11Updated last year
Related projects ⓘ
Alternatives and complementary repositories for toy_model_interpretability
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- This repo is built to facilitate the training and analysis of autoregressive transformers on maze-solving tasks.☆24Updated 2 months ago
- Redwood Research's transformer interpretability tools☆12Updated 2 years ago
- Sparse Autoencoder Training Library☆26Updated last week
- Interpreting how transformers simulate agents performing RL tasks☆69Updated last year
- General-Sum variant of the game Diplomacy for evaluating AIs.☆23Updated 7 months ago
- ☆24Updated last year
- Scaling scaling laws with board games.☆40Updated last year
- ☆18Updated last month
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆13Updated 6 months ago
- Code for reproducing the results from the paper Avoiding Side Effects in Complex Environments☆12Updated 3 years ago
- ☆12Updated 3 years ago
- ☆44Updated last month
- Get language models to generate responses in a specific format reliably. Open source implementation of Synchromesh: Reliable code generat…☆24Updated 8 months ago
- Code for "Evidence of Learned Look-Ahead in a Chess-Playing Neural Network"☆14Updated 5 months ago
- A formalisation of Cartesian Frames, a perspective on embedded agency, in the HOL theorem prover.☆19Updated 2 years ago
- we got you bro☆32Updated 3 months ago
- Code associated to papers on superposition (in ML interpretability)☆24Updated 2 years ago
- unofficial re-implementation of "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets"☆61Updated 2 years ago
- A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations☆13Updated 6 months ago
- Tools for studying developmental interpretability in neural networks.☆74Updated this week
- ☆102Updated last month
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆95Updated 2 years ago
- Measuring the situational awareness of language models☆33Updated 8 months ago
- A neurosymbolic T5 agent for playing text games, from the EACL 2023 paper "Behavior Cloned Transformers are Neurosymbolic Reasoners"☆19Updated last year
- Sparse and discrete interpretability tool for neural networks☆53Updated 8 months ago
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆13Updated 2 weeks ago
- Implementation of OpenAI's 'Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets' paper.☆34Updated last year
- A library for efficient patching and automatic circuit discovery.☆30Updated last month
- Language-annotated Abstraction and Reasoning Corpus☆78Updated last year