adamjermyn / toy_model_interpretability

☆11

Alternatives and similar repositories for toy_model_interpretability:

Users that are interested in toy_model_interpretability are comparing it to the libraries listed below

TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆49Updated 2 years ago
noanabeshima / tinymodel
A TinyStories LM with SAEs and transcoders
☆10Updated 2 weeks ago
understanding-search / maze-transformer
This repo is built to facilitate the training and analysis of autoregressive transformers on maze-solving tasks.
☆25Updated 4 months ago
redwoodresearch / interp
Redwood Research's transformer interpretability tools
☆13Updated 2 years ago
neelnanda-io / Neuroscope
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
☆11Updated last year
neale / avoiding-side-effects
Code for reproducing the results from the paper Avoiding Side Effects in Complex Environments
☆12Updated 3 years ago
poppingtonic / transformer-visualization
Mechanistic Interpretability Tutorials, Results and research log as I learn from publicly available research, and experimentation.
☆10Updated last year
JasonGross / guarantees-based-mechanistic-interpretability
☆11Updated this week
alexander-turner / attainable-utility-preservation
☆12Updated 3 years ago
bilal-chughtai / rep-theory-mech-interp
☆26Updated last year
google-deepmind / cartesian-frames
A formalisation of Cartesian Frames, a perspective on embedded agency, in the HOL theorem prover.
☆19Updated 3 years ago
Sea-Snell / grokking
unofficial re-implementation of "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets"
☆67Updated 2 years ago
neelnanda-io / Grokking
A Mechanistic Interpretability Analysis of Grokking
☆19Updated 2 years ago
HumanCompatibleAI / leela-interp
Code for "Evidence of Learned Look-Ahead in a Chess-Playing Neural Network"
☆17Updated 7 months ago
cognitiveailab / neurosymbolic
A neurosymbolic T5 agent for playing text games, from the EACL 2023 paper "Behavior Cloned Transformers are Neurosymbolic Reasoners"
☆19Updated last year
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆18Updated 9 months ago
hijohnnylin / neuronpedia-scorer
☆15Updated 11 months ago
jbloomAus / DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
☆77Updated last year
minqi / wordcraft
An environment for benchmarking commonsense agents
☆28Updated 4 years ago
EleutherAGI / summarisation
The Intermediate Goal of the project is to train a GPT like architecture to learn to summarise reddit posts from human preferences, as th…
☆12Updated 3 years ago
wattenberg / superposition
Code associated to papers on superposition (in ML interpretability)
☆26Updated 2 years ago
noanabeshima / matryoshka-saes
☆11Updated last month
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆108Updated 2 years ago
Mech-Interp / PySvelte
A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations
☆14Updated 9 months ago
patrik-ha / explainable-minichess
Chess environment for smaller chess variants, AlphaZero-like MCTS-learning, and Concept Detection
☆15Updated last year
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆58Updated 11 months ago
samacqua / LARC
Language-annotated Abstraction and Reasoning Corpus
☆82Updated last year
andyljones / boardlaw
Scaling scaling laws with board games.
☆45Updated last year
apple / ml-np-rasp
☆19Updated last year
you68681 / GPAR
☆21Updated 9 months ago