marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs
Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.
☆12Updated last month
Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs:
Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below
- A library for efficient patching and automatic circuit discovery.☆48Updated 2 months ago
- Sparse Autoencoder Training Library☆39Updated 3 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆31Updated 3 months ago
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆28Updated 7 months ago
- ☆54Updated 2 months ago
- ☆20Updated 4 months ago
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆49Updated 3 months ago
- Algebraic value editing in pretrained language models☆62Updated last year
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆18Updated last week
- ☆59Updated 2 years ago
- ☆43Updated 5 months ago
- ☆17Updated last month
- ☆83Updated last year
- ☆109Updated 5 months ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity☆40Updated last year
- A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.☆35Updated 2 weeks ago
- Sparse probing paper full code.☆54Updated last year
- Universal Neurons in GPT2 Language Models☆27Updated 8 months ago
- This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…☆22Updated 4 months ago
- ☆34Updated 11 months ago
- ☆45Updated this week
- ☆12Updated 10 months ago
- Rewarded soups official implementation☆55Updated last year
- Self-Supervised Alignment with Mutual Information☆16Updated 8 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆62Updated 2 months ago
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆17Updated 10 months ago
- ☆75Updated 5 months ago
- ☆76Updated 6 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆66Updated 2 months ago
- ☆28Updated 9 months ago