chrisliu298 / awesome-representation-engineeringView external linksLinks
A resource repository for representation engineering in large language models
☆148Nov 14, 2024Updated last year
Alternatives and similar repositories for awesome-representation-engineering
Users that are interested in awesome-representation-engineering are comparing it to the libraries listed below
Sorting:
- Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.☆18Dec 6, 2024Updated last year
- [ICLR 2025] General-purpose activation steering library☆142Sep 18, 2025Updated 4 months ago
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization☆42Jul 28, 2024Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆209May 23, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆947Aug 14, 2024Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆140Feb 21, 2025Updated 11 months ago
- ☆57Jun 13, 2024Updated last year
- ☆13Feb 24, 2025Updated 11 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- ☆23Jun 13, 2024Updated last year
- Algebraic value editing in pretrained language models☆68Nov 1, 2023Updated 2 years ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆72Jan 16, 2026Updated 3 weeks ago
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆42Jan 18, 2026Updated 3 weeks ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆342Jun 13, 2025Updated 8 months ago
- A curated list of resources for activation engineering☆123Oct 2, 2025Updated 4 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- The Github repo for our survey paper: "Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large…☆85Jan 30, 2026Updated 2 weeks ago
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆858Jan 29, 2026Updated 2 weeks ago
- ☆247Feb 22, 2024Updated last year
- Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"☆14Nov 22, 2024Updated last year
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆295Jan 22, 2026Updated 3 weeks ago
- Experiments with representation engineering☆13Feb 28, 2024Updated last year
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆12Jan 26, 2025Updated last year
- ☆30Aug 2, 2024Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆106May 20, 2025Updated 8 months ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆81Feb 6, 2026Updated last week
- [ACL 2024 main] Aligning Large Language Models with Human Preferences through Representation Engineering (https://aclanthology.org/2024.…☆28Sep 25, 2024Updated last year
- ☆28Nov 16, 2025Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆163Jun 25, 2025Updated 7 months ago
- Code release for the paper "Style Vectors for Steering Generative Large Language Models", accepted to the Findings of the EACL 2024.☆36Sep 26, 2024Updated last year
- ☆207Oct 14, 2025Updated 4 months ago
- Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.☆18Jan 14, 2025Updated last year
- Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence☆23Jul 31, 2025Updated 6 months ago
- ☆16Mar 5, 2024Updated last year
- A library for mechanistic anomaly detection☆22Jan 9, 2025Updated last year
- ☆25Nov 28, 2024Updated last year
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 5 months ago
- A survey on harmful fine-tuning attack for large language model☆232Jan 9, 2026Updated last month
- A library for making RepE control vectors☆685Sep 24, 2025Updated 4 months ago