safety-research / open-source-alignment-fakingLinks
Open Source Replication of Anthropic's Alignment Faking Paper
☆11Updated 2 months ago
Alternatives and similar repositories for open-source-alignment-faking
Users that are interested in open-source-alignment-faking are comparing it to the libraries listed below
Sorting:
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆27Updated last year
- ☆44Updated last year