redwoodresearch / interpLinks

Redwood Research's transformer interpretability tools

☆14

Alternatives and similar repositories for interp

Users that are interested in interp are comparing it to the libraries listed below

Sorting:

TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆51Updated 3 years ago
JacobPfau / procgenAISC
☆19Updated 2 years ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆70Updated 2 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆53Updated 2 months ago
google-deepmind / dangerous-capability-evaluations
☆55Updated 9 months ago
samacqua / LARC
Language-annotated Abstraction and Reasoning Corpus
☆88Updated 2 years ago
METR / RE-Bench
☆92Updated 2 months ago
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆18Updated 5 months ago
noanabeshima / matryoshka-saes
☆18Updated 7 months ago
Sea-Snell / grokking
unofficial re-implementation of "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets"
☆77Updated 3 years ago
callummcdougall / sae_visualizer
☆28Updated last year
neelnanda-io / 1L-Sparse-Autoencoder
☆122Updated last year
longtermrisk / openweights
A python sdk for LLM finetuning and inference on runpod infrastructure
☆11Updated last week
redwoodresearch / alignment_faking_public
☆69Updated last month
aypan17 / machiavelli
☆137Updated 8 months ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 5 months ago
Butanium / nnterp
A small package implementing some useful wrapping around nnsight
☆14Updated this week
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆40Updated last year
nrimsky / InfluenceFunctions
Implementation of Influence Function approximations for differently sized ML models, using PyTorch
☆15Updated last year
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆99Updated 3 weeks ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆56Updated 8 months ago
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆68Updated 2 years ago
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆18Updated last year
victorvikram / ConceptARC
Materials for ConceptARC paper
☆96Updated 8 months ago
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆182Updated 2 years ago
FlyingPumba / InterpBench
A benchmark for mechanistic discovery of circuits in Transformers
☆14Updated 7 months ago
andyljones / boardlaw
Scaling scaling laws with board games.
☆49Updated last year
thestephencasper / everything-you-need
we got you bro
☆35Updated 11 months ago
jbloomAus / DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
☆87Updated last year