marcus-jw / Targeted-Manipulation-and-Deception-in-LLMsLinks
Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.
☆23Updated last year
Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs
Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below
Sorting:
- A TinyStories LM with SAEs and transcoders☆14Updated 9 months ago
- Code repo for the model organisms and convergent directions of EM papers.☆42Updated 3 months ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Updated last year
- Sparse Autoencoder Training Library☆56Updated 8 months ago
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers☆26Updated 10 months ago
- ☆32Updated 6 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆47Updated 9 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆83Updated last year
- A library for efficient patching and automatic circuit discovery.☆85Updated 3 weeks ago
- ☆98Updated last year
- PostTrainBench measures how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆114Updated this week
- Code for "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"☆26Updated 3 months ago
- Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.☆16Updated 2 months ago
- ☆99Updated last year
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆18Updated last year
- ☆22Updated 7 months ago
- ☆137Updated last year
- Code for "Reasoning to Learn from Latent Thoughts"☆124Updated 9 months ago
- Dataset and benchmark for assessing LLMs in translating natural language descriptions of planning problems into PDDL☆63Updated last year
- Directional Preference Alignment☆58Updated last year
- Sotopia-π: Interactive Learning of Socially Intelligent Language Agents (ACL 2024)☆81Updated last year
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks☆51Updated last year
- ☆107Updated last year
- ☆25Updated 2 years ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆30Updated 2 months ago
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆28Updated 11 months ago
- ☆28Updated last year
- Sotopia-RL: Reward Design for Social Intelligence☆46Updated 5 months ago
- ☆19Updated 5 months ago
- ☆34Updated 11 months ago