andyrdt / refusal_directionLinks

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".

☆290

Alternatives and similar repositories for refusal_direction

Users that are interested in refusal_direction are comparing it to the libraries listed below

Sorting:

GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆238Updated last year
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆188Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆126Updated 8 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
jacobdunefsky / transcoder_circuits
☆181Updated 11 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆219Updated this week
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆157Updated this week
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆111Updated last month
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆209Updated 11 months ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆280Updated last year
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆138Updated 11 months ago
davidbau / baukit
☆234Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 4 months ago
saprmarks / feature-circuits
☆191Updated last week
safety-research / persona_vectors
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆266Updated 2 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆222Updated 10 months ago
tonychenxyz / selfie
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆52Updated 10 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆146Updated 4 months ago
adamkarvonen / SAEBench
☆131Updated last week
shengliu66 / ICV
Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
☆190Updated 8 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆276Updated last year
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆251Updated 6 months ago
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆206Updated 11 months ago
da03 / Internalize_CoT_Step_by_Step
☆195Updated 6 months ago
openai / sparse_autoencoder
☆530Updated last year
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆117Updated last year
cooperleong00 / Awesome-LLM-Interpretability
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆273Updated 7 months ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆326Updated last year
dsbowen / strong_reject
☆102Updated 3 months ago