andyrdt / refusal_directionLinks

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
233Updated last week

Alternatives and similar repositories for refusal_direction

Users that are interested in refusal_direction are comparing it to the libraries listed below

Sorting: