Representation Engineering: A Top-Down Approach to AI Transparency
☆953Aug 14, 2024Updated last year
Alternatives and similar repositories for representation-engineering
Users that are interested in representation-engineering are comparing it to the libraries listed below
Sorting:
- A library for making RepE control vectors☆689Sep 24, 2025Updated 5 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- A resource repository for representation engineering in large language models☆148Nov 14, 2024Updated last year
- A library for mechanistic interpretability of GPT-style language models☆3,112Updated this week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆247Updated this week
- ☆271Oct 1, 2024Updated last year
- ☆209Oct 14, 2025Updated 4 months ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆571Jan 28, 2025Updated last year
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆863Jan 29, 2026Updated last month
- Steering Llama 2 with Contrastive Activation Addition☆212May 23, 2024Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆139Feb 21, 2025Updated last year
- Training Sparse Autoencoders on Language Models☆1,219Updated this week
- ☆250Feb 22, 2024Updated 2 years ago
- Using sparse coding to find distributed representations used by neural networks.☆297Nov 10, 2023Updated 2 years ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆825Updated this week
- Tools for understanding how transformer predictions are built layer-by-layer☆567Aug 7, 2025Updated 6 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- Sparse Autoencoder for Mechanistic Interpretability☆292Jul 20, 2024Updated last year
- ☆571Jul 19, 2024Updated last year
- Algebraic value editing in pretrained language models☆68Nov 1, 2023Updated 2 years ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆101Sep 21, 2023Updated 2 years ago
- ☆284Mar 2, 2024Updated last year
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆864Aug 16, 2024Updated last year
- Stanford NLP Python library for Representation Finetuning (ReFT)☆1,558Jan 14, 2026Updated last month
- ☆117Feb 11, 2025Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆351Jun 13, 2025Updated 8 months ago
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- ☆1,072Mar 6, 2024Updated last year
- Erasing concepts from neural representations with provable guarantees☆243Jan 27, 2025Updated last year
- Universal and Transferable Attacks on Aligned Language Models☆4,521Aug 2, 2024Updated last year
- Robust recipes to align language models with human and AI preferences☆5,506Sep 8, 2025Updated 5 months ago
- ☆140Aug 4, 2024Updated last year
- ☆100Aug 8, 2024Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆341Feb 23, 2024Updated 2 years ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆377Jan 23, 2025Updated last year
- ☆396Aug 21, 2025Updated 6 months ago
- ☆4,110Jun 4, 2024Updated last year
- ☆196Nov 26, 2023Updated 2 years ago
- Sparsify transformers with SAEs and transcoders☆696Feb 23, 2026Updated last week