anthropics / toy-models-of-superpositionLinks

Notebooks accompanying Anthropic's "Toy Models of Superposition" paper

☆127

Alternatives and similar repositories for toy-models-of-superposition

Users that are interested in toy-models-of-superposition are comparing it to the libraries listed below

Sorting:

neelnanda-io / 1L-Sparse-Autoencoder
☆123Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆200Updated this week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆207Updated 7 months ago
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆184Updated 2 years ago
ArthurConmy / Automatic-Circuit-Discovery
☆233Updated 10 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆54Updated 3 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆272Updated 7 months ago
google-deepmind / mishax
☆134Updated 4 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆57Updated 9 months ago
mechanistic-interpretability-grokking / progress-measures-paper
☆68Updated 2 years ago
bilal-chughtai / rep-theory-mech-interp
☆26Updated 2 years ago
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆207Updated last week
redwoodresearch / Easy-Transformer
☆121Updated 11 months ago
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆231Updated 6 months ago
KihoPark / linear_rep_geometry
☆103Updated 5 months ago
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆63Updated last year
jbloomAus / SAEDashboard
☆62Updated this week
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆28Updated last month
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆100Updated last month
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆73Updated 2 years ago
callummcdougall / sae_visualizer
☆28Updated last year
wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
☆30Updated last year
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆77Updated 8 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆73Updated last week
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆257Updated last year
TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆51Updated 3 years ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆119Updated 5 months ago
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆247Updated last year
KihoPark / LLM_Categorical_Hierarchical_Representations
☆104Updated 5 months ago
ARBORproject / arborproject.github.io
☆81Updated 5 months ago