keven980716 / weak-to-strong-deception
View external linksLinks

[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
14Jun 21, 2024Updated last year

Alternatives and similar repositories for weak-to-strong-deception

Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below

Sorting:

Are these results useful?