A resource repository for representation engineering in large language models
☆148Nov 14, 2024Updated last year
Alternatives and similar repositories for awesome-representation-engineering
Users that are interested in awesome-representation-engineering are comparing it to the libraries listed below
Sorting:
- Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.☆19Dec 6, 2024Updated last year
- [ICLR 2025] General-purpose activation steering library☆145Sep 18, 2025Updated 5 months ago
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization☆42Jul 28, 2024Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆213May 23, 2024Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆953Aug 14, 2024Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆140Feb 21, 2025Updated last year
- ☆13Feb 24, 2025Updated last year
- ☆58Jun 13, 2024Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- ☆23Jun 13, 2024Updated last year
- Algebraic value editing in pretrained language models☆69Nov 1, 2023Updated 2 years ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆76Jan 16, 2026Updated last month
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆42Jan 18, 2026Updated last month
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆355Jun 13, 2025Updated 8 months ago
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- A curated list of resources for activation engineering☆128Oct 2, 2025Updated 5 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆866Updated this week
- Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"☆14Nov 22, 2024Updated last year
- ☆250Feb 22, 2024Updated 2 years ago
- The Github repo for our survey paper: "Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large…☆92Jan 30, 2026Updated last month
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆294Jan 22, 2026Updated last month
- Experiments with representation engineering☆14Feb 28, 2024Updated 2 years ago
- ☆30Aug 2, 2024Updated last year
- Official Code Repository for LM-Steer Paper: "Word Embeddings Are Steers for Language Models" (ACL 2024 Outstanding Paper Award)☆139Jul 13, 2025Updated 7 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆107May 20, 2025Updated 9 months ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Feb 27, 2026Updated last week
- [ACL 2024 main] Aligning Large Language Models with Human Preferences through Representation Engineering (https://aclanthology.org/2024.…☆28Sep 25, 2024Updated last year
- ☆28Nov 16, 2025Updated 3 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆168Feb 22, 2026Updated last week
- Code release for the paper "Style Vectors for Steering Generative Large Language Models", accepted to the Findings of the EACL 2024.☆36Sep 26, 2024Updated last year
- ☆209Oct 14, 2025Updated 4 months ago
- Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.☆18Jan 14, 2025Updated last year
- ☆19Mar 5, 2024Updated 2 years ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 6 months ago
- Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence☆26Jul 31, 2025Updated 7 months ago
- A library for mechanistic anomaly detection☆22Jan 9, 2025Updated last year
- A survey on harmful fine-tuning attack for large language model☆232Feb 25, 2026Updated last week
- A library for making RepE control vectors☆691Sep 24, 2025Updated 5 months ago