HumanCompatibleAI / interpreting-rewards

Experiments in applying interpretability techniques to learned reward functions.
9Updated 3 years ago

Related projects

Alternatives and complementary repositories for interpreting-rewards