andyrdt / refusal_direction

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
153Updated 3 months ago

Alternatives and similar repositories for refusal_direction:

Users that are interested in refusal_direction are comparing it to the libraries listed below