marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs

Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.
12Updated last month

Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs:

Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below