ai-safety-foundation / sparse_autoencoderLinks

Sparse Autoencoder for Mechanistic Interpretability

☆257

Alternatives and similar repositories for sparse_autoencoder

Users that are interested in sparse_autoencoder are comparing it to the libraries listed below

Sorting:

HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆261Updated last year
saprmarks / dictionary_learning
☆324Updated 3 weeks ago
ArthurConmy / Automatic-Circuit-Discovery
☆234Updated 10 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆207Updated 7 months ago
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆598Updated this week
adamkarvonen / SAEBench
☆109Updated 3 weeks ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆272Updated 7 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆124Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆202Updated this week
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆900Updated this week
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆622Updated last week
saprmarks / feature-circuits
☆183Updated 3 weeks ago
jacobdunefsky / transcoder_circuits
☆157Updated 8 months ago
openai / sparse_autoencoder
☆505Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆120Updated 5 months ago
neelnanda-io / Crosscoders
☆51Updated 8 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆167Updated last year
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆54Updated 3 months ago
redwoodresearch / Easy-Transformer
☆121Updated last year
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆512Updated last year
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆366Updated 9 months ago
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆786Updated this week
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆57Updated 9 months ago
ARBORproject / arborproject.github.io
☆81Updated 5 months ago
science-of-finetuning / crosscoder_learning
Modified to support crosscoder training.
☆21Updated last week
Prisma-Multimodal / ViT-Prisma
ViT Prisma is a mechanistic interpretability library for Vision and Video Transformers (ViTs).
☆292Updated 2 weeks ago
redwoodresearch / alignment_faking_public
☆70Updated 2 months ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆58Updated last year
KihoPark / linear_rep_geometry
☆103Updated 5 months ago
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆141Updated this week