andyrdt / refusal_direction

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
117Updated last month

Related projects

Alternatives and complementary repositories for refusal_direction