jbloomAus / SAELensLinks
Training Sparse Autoencoders on Language Models
☆802Updated last week
Alternatives and similar repositories for SAELens
Users that are interested in SAELens are comparing it to the libraries listed below
Sorting:
- Sparsify transformers with SAEs and transcoders☆553Updated this week
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆579Updated this week
- Sparse Autoencoder for Mechanistic Interpretability☆248Updated 10 months ago
- ☆302Updated 2 weeks ago
- ☆480Updated 10 months ago
- Mechanistic Interpretability Visualizations using React☆253Updated 5 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆200Updated 5 months ago
- Using sparse coding to find distributed representations used by neural networks.☆247Updated last year
- A library for mechanistic interpretability of GPT-style language models☆2,217Updated this week
- ☆223Updated 8 months ago
- This repository collects all relevant resources about interpretability in LLMs☆353Updated 7 months ago
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆750Updated this week
- ☆124Updated 6 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆497Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆181Updated this week
- ☆97Updated last month
- ☆567Updated this week
- ☆171Updated last month
- ☆121Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆831Updated 9 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆223Updated 8 months ago
- ☆117Updated 10 months ago
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆214Updated last year
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆116Updated this week
- Steering Llama 2 with Contrastive Activation Addition☆155Updated last year
- ☆209Updated last year
- ☆156Updated 6 months ago
- ☆43Updated 6 months ago
- Extract full next-token probabilities via language model APIs☆248Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆55Updated 7 months ago