EleutherAI / sae

Sparse autoencoders

☆407

Alternatives and similar repositories for sae:

Users that are interested in sae are comparing it to the libraries listed below

jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆573Updated this week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆176Updated last month
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆209Updated 5 months ago
saprmarks / dictionary_learning
☆179Updated this week
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆207Updated last year
EleutherAI / sae-auto-interp
☆135Updated this week
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆219Updated 3 weeks ago
openai / sparse_autoencoder
☆404Updated 5 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆201Updated 3 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆114Updated last year
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆458Updated this week
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆78Updated last month
saprmarks / feature-circuits
☆131Updated 3 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆153Updated 3 months ago
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆230Updated 10 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆113Updated 7 months ago
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆305Updated 2 months ago
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆82Updated this week
google-deepmind / mishax
☆115Updated this week
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆149Updated 2 months ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆459Updated 7 months ago
redwoodresearch / Easy-Transformer
☆106Updated 5 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆174Updated 3 months ago
mcleish7 / arithmetic
Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)
☆182Updated 7 months ago
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆219Updated last month
davidbau / baukit
☆184Updated 10 months ago
HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆175Updated this week
collin-burns / discovering_latent_knowledge
☆258Updated 10 months ago
jacobdunefsky / transcoder_circuits
☆53Updated 2 months ago