EleutherAI / sparsifyLinks

Sparsify transformers with SAEs and transcoders

☆595

Alternatives and similar repositories for sparsify

Users that are interested in sparsify are comparing it to the libraries listed below

Sorting:

jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆895Updated this week
saprmarks / dictionary_learning
☆320Updated 2 weeks ago
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆619Updated this week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆207Updated 7 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆257Updated last year
openai / sparse_autoencoder
☆503Updated last year
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆267Updated 7 months ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆260Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆196Updated last week
ArthurConmy / Automatic-Circuit-Discovery
☆233Updated 10 months ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆512Updated last year
jacobdunefsky / transcoder_circuits
☆152Updated 8 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆123Updated last year
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆779Updated last week
adamkarvonen / SAEBench
☆107Updated 2 weeks ago
saprmarks / feature-circuits
☆183Updated 2 weeks ago
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆366Updated 9 months ago
Prisma-Multimodal / ViT-Prisma
ViT Prisma is a mechanistic interpretability library for Vision and Video Transformers (ViTs).
☆289Updated last week
neelnanda-io / Crosscoders
☆50Updated 8 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆246Updated last month
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆219Updated last year
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆194Updated 8 months ago
ARBORproject / arborproject.github.io
☆81Updated 5 months ago
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆157Updated 3 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆165Updated last year
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆247Updated last year
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆177Updated 8 months ago
collin-burns / discovering_latent_knowledge
☆274Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆118Updated 5 months ago
science-of-finetuning / crosscoder_learning
Modified to support crosscoder training.
☆22Updated this week