marcus-jw / Targeted-Manipulation-and-Deception-in-LLMsLinks

Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-turn RL environment with support for agent, user, user feedback, transition and veto models. It also implements KTO and expert iteration for training on user preferences.

☆22

Alternatives and similar repositories for Targeted-Manipulation-and-Deception-in-LLMs

Users that are interested in Targeted-Manipulation-and-Deception-in-LLMs are comparing it to the libraries listed below

Sorting:

Jiaxin-Wen / MisleadLM
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆18Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
clarifying-EM / model-organisms-for-EM
Code repo for the model organisms and convergent directions of EM papers.
☆36Updated last month
noanabeshima / tinymodel
A TinyStories LM with SAEs and transcoders
☆13Updated 7 months ago
callummcdougall / sae-exercises-mats
☆24Updated last year
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
saprmarks / geometry-of-truth
☆94Updated last year
samuelarnesen / nyu-debate-modeling
☆23Updated last year
jiahai-feng / binding-iclr
☆16Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆84Updated 11 months ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 9 months ago
XuchanBao / behavioral-self-awareness
☆32Updated 9 months ago
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆39Updated last year
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆42Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
redwoodresearch / Easy-Transformer
☆129Updated last year
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆28Updated 3 weeks ago
ryoungj / BoLT
Code for "Reasoning to Learn from Latent Thoughts"
☆122Updated 7 months ago
aryamanarora / causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
☆50Updated 11 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆141Updated 4 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆121Updated last year
wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
☆31Updated last year
facebookresearch / rlfh-gen-div
This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity
☆47Updated last year
tim-lawson / mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆26Updated 9 months ago
idanshen / Value-Augmented-Sampling
☆20Updated last year
mcleish7 / gemstone-scaling-laws
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)
☆29Updated last month
alexrame / rewardedsoups
Rewarded soups official implementation
☆62Updated 2 years ago
Asap7772 / understanding-rlhf
Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…
☆32Updated last year