andyrdt / refusal_directionLinks

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
223Updated 8 months ago

Alternatives and similar repositories for refusal_direction

Users that are interested in refusal_direction are comparing it to the libraries listed below

Sorting: