marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs

Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.
13Updated 3 months ago

Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs:

Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below