EleutherAI / sae-auto-interp

☆135

Alternatives and similar repositories for sae-auto-interp:

Users that are interested in sae-auto-interp are comparing it to the libraries listed below

steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆78Updated last month
saprmarks / feature-circuits
☆131Updated 3 months ago
jacobdunefsky / transcoder_circuits
☆53Updated 2 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆176Updated last month
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆113Updated 7 months ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆207Updated last year
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆82Updated this week
adamkarvonen / SAEBench
☆41Updated this week
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆66Updated last month
neelnanda-io / 1L-Sparse-Autoencoder
☆114Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆46Updated last month
ArthurConmy / Automatic-Circuit-Discovery
☆201Updated 3 months ago
saprmarks / geometry-of-truth
☆75Updated 5 months ago
redwoodresearch / Easy-Transformer
☆106Updated 5 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆28Updated 2 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆153Updated 3 months ago
saprmarks / dictionary_learning
☆179Updated this week
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆104Updated last month
EleutherAI / sae
Sparse autoencoders
☆407Updated this week
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆209Updated 5 months ago
neelnanda-io / Crosscoders
☆21Updated last month
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆131Updated 3 months ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆149Updated 2 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆219Updated 3 weeks ago
google-deepmind / mishax
☆115Updated this week
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆85Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆174Updated 3 months ago
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆27Updated 7 months ago
KihoPark / linear_rep_geometry
☆82Updated 11 months ago
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆18Updated last year