keven980716 / weak-to-strong-deception

[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
13Updated 9 months ago

Alternatives and similar repositories for weak-to-strong-deception:

Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below