amack315 / unsupervised-steering-vectorsView external linksLinks
☆36Apr 30, 2024Updated last year
Alternatives and similar repositories for unsupervised-steering-vectors
Users that are interested in unsupervised-steering-vectors are comparing it to the libraries listed below
Sorting:
- Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"☆13Jul 18, 2024Updated last year
- ☆146Dec 30, 2025Updated last month
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.☆18Dec 6, 2024Updated last year
- ☆88Dec 18, 2025Updated last month
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- ☆25Nov 28, 2024Updated last year
- Decoder only transformer, built from scratch with PyTorch☆32Oct 22, 2023Updated 2 years ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆240Dec 16, 2024Updated last year
- Sparse Autoencoder Training Library☆56May 1, 2025Updated 9 months ago
- Situational Awareness Dataset☆43Dec 14, 2024Updated last year
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆24Feb 6, 2025Updated last year
- ☆25Sep 5, 2024Updated last year
- A blog on AI, personal development, and living a good life.☆35Updated this week
- The AI that helps you achieve your goals☆11Feb 4, 2024Updated 2 years ago
- Steering Llama 2 with Contrastive Activation Addition☆209May 23, 2024Updated last year
- ☆207Oct 14, 2025Updated 4 months ago
- ☆12Jul 12, 2024Updated last year
- Fluent dreaming for language models☆13Jul 22, 2024Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆241Updated this week
- ☆20Nov 15, 2024Updated last year
- A quick way to get started with Transformer Lens☆14Dec 13, 2023Updated 2 years ago
- Data and code for the paper: Finding Safety Neurons in Large Language Models☆20Jan 29, 2026Updated 2 weeks ago
- ☆16Jul 9, 2025Updated 7 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆342Jun 13, 2025Updated 8 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆291Jul 20, 2024Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆64Oct 27, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated 10 months ago
- Sparse probing paper full code.☆66Dec 17, 2023Updated 2 years ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆811Updated this week
- Measuring and Controlling Persona Drift in Language Model Dialogs☆21Feb 26, 2024Updated last year
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆18Oct 11, 2024Updated last year
- ☆17Updated this week
- Implementation of Influence Function approximations for differently sized ML models, using PyTorch☆16Sep 15, 2023Updated 2 years ago
- Unofficial Implementation of Selective Attention Transformer☆20Oct 31, 2024Updated last year
- TuneTables is a tabular classifier that implements prompt tuning for frozen prior-fitted networks.☆23Mar 31, 2025Updated 10 months ago
- (Model-written) LLM evals library☆18Dec 13, 2024Updated last year
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆83Nov 27, 2024Updated last year
- Applying SAEs for fine-grained control☆25Dec 15, 2024Updated last year