Jiaxin-Wen/MisleadLM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Jiaxin-Wen/MisleadLM)

Jiaxin-Wen / MisleadLM

Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""

☆20

Alternatives and similar repositories for MisleadLM

Users that are interested in MisleadLM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

callummcdougall / sae_visualizer
View on GitHub
☆31Apr 4, 2024Updated 2 years ago
raybears / cot-transparency
View on GitHub
Improving transparency of large language models' reasoning
☆15Nov 25, 2025Updated 8 months ago
rgreenblatt / control-evaluations
View on GitHub
☆25May 25, 2024Updated 2 years ago
marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs
View on GitHub
Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…
☆25Dec 3, 2024Updated last year
mishajw / repeng
View on GitHub
Experiments with representation engineering
☆14Feb 28, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
jplhughes / dotfiles
View on GitHub
Easily deploy my zsh and tmux configuration on new machines. Includes local and remote aliases to improve workflow.
☆15Apr 23, 2026Updated 3 months ago
anthropics / sycophancy-to-subterfuge-paper
View on GitHub
☆28Sep 5, 2024Updated last year
koayon / atp_star
View on GitHub
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Jan 19, 2025Updated last year
tripos-education / maths-tripos-questions
View on GitHub
Archive of questions from the Cambridge Mathematics Tripos
☆10Jun 6, 2022Updated 4 years ago
simple-stories / simple_stories_train
View on GitHub
Trains small LMs. Designed for training on SimpleStories
☆14Sep 15, 2025Updated 10 months ago
slavachalnev / SAE-TS
View on GitHub
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆29Nov 20, 2024Updated last year
AstroJays-Hopkins / Retired-Avionics
View on GitHub
Recovery and Propulsion control and monitoring
☆11May 15, 2022Updated 4 years ago
rgreenblatt / model_organism_public
View on GitHub
☆15Jun 17, 2025Updated last year
HumanCompatibleAI / overcooked-hAI-exp
View on GitHub
Overcooked-AI Experiment Psiturk Demo (for MTurk experiments)
☆13May 10, 2021Updated 5 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
jcmgray / einsum_bmm
View on GitHub
einsum via batch matrix multiply
☆15Nov 29, 2023Updated 2 years ago
saprmarks / geometry-of-truth
View on GitHub
☆114Aug 8, 2024Updated last year
alan-cooney / transformer-lens-starter-template
View on GitHub
A quick way to get started with Transformer Lens
☆14Dec 13, 2023Updated 2 years ago
CornellDataScience / FiggieBot
View on GitHub
Creating a game to play Figgie & Train an agent to play against
☆15Dec 3, 2022Updated 3 years ago
safety-research / safety-tooling
View on GitHub
Inference API for many LLMs and other useful tools for empirical research
☆134May 29, 2026Updated 2 months ago
FarnoushRJ / RelP
View on GitHub
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in La…
☆29Nov 3, 2025Updated 8 months ago
jettjaniak / chainscope
View on GitHub
Repository for the "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" paper
☆35Mar 31, 2026Updated 3 months ago
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆268Feb 27, 2026Updated 5 months ago
wyu-du / Controlled-Dialogue-Generation
View on GitHub
This repository contains the data and code for the paper "SideControl: Controlled Open-domain Dialogue Generation via Additive Side Netwo…
☆12Dec 1, 2021Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
siyan-sylvia-li / arxivParser
View on GitHub
☆18Sep 21, 2023Updated 2 years ago
ApolloResearch / sample
View on GitHub
Repository with sample code using Apollo's suggested engineering practices
☆15Dec 16, 2024Updated last year
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
facebookresearch / jailbreak-objectives
View on GitHub
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆37Jul 2, 2026Updated 3 weeks ago
comp-journalism / list-of-algorithm-audits
View on GitHub
A list of algorithm audit studies - now searchable and filterable!
☆19Apr 4, 2026Updated 3 months ago
alexander-turner / TurnTrout.com
View on GitHub
A blog on AI, personal development, and living a good life.
☆46Updated this week
amack315 / unsupervised-steering-vectors
View on GitHub
☆38Apr 30, 2024Updated 2 years ago
angie-chen55 / pref-learning-ranking-acc
View on GitHub
☆13Jun 4, 2024Updated 2 years ago
safety-research / safety-examples
View on GitHub
☆31Nov 11, 2025Updated 8 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
y-z-zhang / SBD
View on GitHub
A simple algorithm that finds a simultaneous block diagonalization of multiple matrices through the eigendecomposition of a single matrix…
☆16Feb 24, 2026Updated 5 months ago
mit-ccc / acl-nuse-personal-narratives
View on GitHub
Exploring aspects of similarity between spoken personal narratives by disentangling them into narrative clause types -- Supplementary inf…
☆12Jul 14, 2020Updated 6 years ago
alignedai / HappyFaces
View on GitHub
The Happy Faces Benchmark
☆15Jul 20, 2023Updated 3 years ago
allenai / noncompliance
View on GitHub
This repository contains data, code and models for contextual noncompliance.
☆26Jul 18, 2024Updated 2 years ago
shadowkiller33 / Contrast-Instruction
View on GitHub
☆19Oct 2, 2023Updated 2 years ago
zijwang / talkdown
View on GitHub
Dataset and pre-trained model of EMNLP-IJCNLP 2019 paper "TalkDown: A Corpus for Condescension Detection in Context."
☆10Jan 26, 2020Updated 6 years ago
Delineo-Disease-Modeling / PandemicModel
View on GitHub
Repository for Delineo Disease Modeling at Johns Hopkins University
☆18May 10, 2023Updated 3 years ago