marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs

Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.
15Updated 5 months ago

Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs

Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below

Sorting: