jbloomAus / SAELensLinks

Training Sparse Autoencoders on Language Models

☆995

Alternatives and similar repositories for SAELens

Users that are interested in SAELens are comparing it to the libraries listed below

Sorting:

EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆640Updated this week
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆683Updated this week
saprmarks / dictionary_learning
☆348Updated last month
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆272Updated last year
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆291Updated 10 months ago
openai / sparse_autoencoder
☆527Updated last year
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆274Updated last year
TransformerLensOrg / TransformerLens
A library for mechanistic interpretability of GPT-style language models
☆2,675Updated this week
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆819Updated last month
ArthurConmy / Automatic-Circuit-Discovery
☆244Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆221Updated 10 months ago
adamkarvonen / SAEBench
☆131Updated last week
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆532Updated 2 months ago
jacobdunefsky / transcoder_circuits
☆175Updated 11 months ago
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆374Updated 11 months ago
callummcdougall / ARENA_3.0
☆747Updated 2 weeks ago
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆202Updated 10 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆283Updated 4 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆188Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆216Updated last week
andyzoujm / representation-engineering
Representation Engineering: A Top-Down Approach to AI Transparency
☆897Updated last year
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆228Updated 2 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆128Updated last year
saprmarks / feature-circuits
☆190Updated this week
neelnanda-io / Crosscoders
☆54Updated 11 months ago
redwoodresearch / Easy-Transformer
☆126Updated last year
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆154Updated this week
Prisma-Multimodal / ViT-Prisma
ViT Prisma is a mechanistic interpretability library for Vision and Video Transformers (ViTs).
☆311Updated 2 months ago
alan-cooney / transformer-from-scratch
Decoder only transformer, built from scratch with PyTorch
☆31Updated last year
collin-burns / discovering_latent_knowledge
☆278Updated last year