☆36Apr 30, 2024Updated last year
Alternatives and similar repositories for unsupervised-steering-vectors
Users that are interested in unsupervised-steering-vectors are comparing it to the libraries listed below
Sorting:
- Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"☆13Jul 18, 2024Updated last year
- ☆153Dec 30, 2025Updated 2 months ago
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.☆19Dec 6, 2024Updated last year
- ☆89Dec 18, 2025Updated 2 months ago
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- Decoder only transformer, built from scratch with PyTorch☆33Oct 22, 2023Updated 2 years ago
- ☆27Nov 28, 2024Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆247Feb 27, 2026Updated last week
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- Situational Awareness Dataset☆46Dec 14, 2024Updated last year
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆23Feb 6, 2025Updated last year
- ☆25Sep 5, 2024Updated last year
- A blog on AI, personal development, and living a good life.☆35Updated this week
- The AI that helps you achieve your goals☆11Feb 4, 2024Updated 2 years ago
- Steering Llama 2 with Contrastive Activation Addition☆213May 23, 2024Updated last year
- ☆209Oct 14, 2025Updated 4 months ago
- Fluent dreaming for language models☆13Jul 22, 2024Updated last year
- ☆12Jul 12, 2024Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆244Updated this week
- ☆20Nov 15, 2024Updated last year
- Data and code for the paper: Finding Safety Neurons in Large Language Models☆22Jan 29, 2026Updated last month
- ☆17Jul 9, 2025Updated 7 months ago
- A quick way to get started with Transformer Lens☆14Dec 13, 2023Updated 2 years ago
- Sparse Autoencoder for Mechanistic Interpretability☆292Jul 20, 2024Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆355Jun 13, 2025Updated 8 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆64Oct 27, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated 11 months ago
- Sparse probing paper full code.☆67Dec 17, 2023Updated 2 years ago
- Measuring and Controlling Persona Drift in Language Model Dialogs☆21Feb 26, 2024Updated 2 years ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆19Oct 11, 2024Updated last year
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆836Updated this week
- Implementation of Influence Function approximations for differently sized ML models, using PyTorch☆16Sep 15, 2023Updated 2 years ago
- Unofficial Implementation of Selective Attention Transformer☆21Oct 31, 2024Updated last year
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆84Nov 27, 2024Updated last year
- TuneTables is a tabular classifier that implements prompt tuning for frozen prior-fitted networks.☆23Mar 31, 2025Updated 11 months ago
- (Model-written) LLM evals library☆18Dec 13, 2024Updated last year
- ☆25Nov 11, 2025Updated 3 months ago
- ☆20Feb 17, 2023Updated 3 years ago