andyrdt / refusal_directionView on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
355Jun 13, 2025Updated 8 months ago

Alternatives and similar repositories for refusal_direction

Users that are interested in refusal_direction are comparing it to the libraries listed below

Sorting:

Are these results useful?