keven980716 / weak-to-strong-deception

Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
10Updated 4 months ago

Related projects

Alternatives and complementary repositories for weak-to-strong-deception