anthropics/sycophancy-to-subterfuge-paper

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/anthropics/sycophancy-to-subterfuge-paper)

anthropics / sycophancy-to-subterfuge-paper

☆28

Alternatives and similar repositories for sycophancy-to-subterfuge-paper

Users that are interested in sycophancy-to-subterfuge-paper are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

anthropics / rogue-deploy-eval
View on GitHub
☆16Jan 21, 2025Updated last year
keing1 / reward-hack-generalization
View on GitHub
Datasets used in the paper "Reward hacking behavior can generalize across tasks"
☆15Aug 17, 2025Updated 11 months ago
anthropics / hypercorn
View on GitHub
Hypercorn is an ASGI and WSGI Server based on Hyper libraries and inspired by Gunicorn.
☆21Jan 12, 2026Updated 6 months ago
Jiaxin-Wen / MisleadLM
View on GitHub
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆20Oct 11, 2024Updated last year
shauli-ravfogel / adv-kernel-removal
View on GitHub
☆12Oct 23, 2022Updated 3 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
ApolloResearch / e2e_sae
View on GitHub
Sparse Autoencoder Training Library
☆58May 1, 2025Updated last year
anthropics / sleeper-agents-paper
View on GitHub
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆150Mar 9, 2024Updated 2 years ago
tim-hua-01 / steering-eval-awareness-public
View on GitHub
☆17Mar 16, 2026Updated 4 months ago
milesaturpin / cot-unfaithfulness
View on GitHub
☆57Oct 23, 2023Updated 2 years ago
anthropics / toy-models-of-superposition
View on GitHub
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆157Sep 14, 2022Updated 3 years ago
slavachalnev / SAE-TS
View on GitHub
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆29Nov 20, 2024Updated last year
noanabeshima / tinymodel
View on GitHub
A TinyStories LM with SAEs and transcoders
☆14Apr 3, 2025Updated last year
apple / pkl-package-docs
View on GitHub
Documentation for Pkl packages
☆18Updated this week
anthropics / anthropic-bedrock-python
View on GitHub
☆59Feb 13, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
QwenLM / ConsisEval
View on GitHub
☆14Jul 5, 2024Updated 2 years ago
anthropics / anthropic-bedrock-typescript
View on GitHub
☆29Feb 14, 2024Updated 2 years ago
ApolloResearch / apd
View on GitHub
Attribution-based Parameter Decomposition
☆35Jun 11, 2025Updated last year
alextamkin / active-learning-pretrained-models
View on GitHub
Active Learning Helps Pretrained Models Learn the Intended Task (https://arxiv.org/abs/2204.08491) by Alex Tamkin, Dat Nguyen, Salil Desh…
☆11Nov 22, 2022Updated 3 years ago
redwoodresearch / alignment_faking_public
View on GitHub
☆95Oct 8, 2025Updated 9 months ago
oxfordinternetinstitute / oxonfair
View on GitHub
Fairness toolkit for pytorch, scikit learn and autogluon
☆33Jul 17, 2026Updated last week
facebookresearch / DIG-In
View on GitHub
This library supports evaluating disparities in generated image quality, diversity, and consistency between geographic regions.
☆20Jun 3, 2024Updated 2 years ago
neelnanda-io / 1L-Sparse-Autoencoder
View on GitHub
☆141Oct 28, 2023Updated 2 years ago
safety-research / believe-it-or-not
View on GitHub
Code and data for editing model beliefs with SDF and other methods, and for evaluating the depth of the implanted beliefs.
☆16Oct 23, 2025Updated 9 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
JCocola / weird-generalization-and-inductive-backdoors
View on GitHub
Code and materials for "Weird Generalization and Inductive Backdoors"
☆41Jan 11, 2026Updated 6 months ago
tokeron / DiffusionLens
View on GitHub
☆16Jan 30, 2025Updated last year
SakanaAI / CycleQD
View on GitHub
CycleQD is a framework for parameter space model merging.
☆48Feb 1, 2025Updated last year
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆268Feb 27, 2026Updated 5 months ago
peterbhase / LAS-NL-Explanations
View on GitHub
Code for paper "Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?"
☆21Oct 13, 2020Updated 5 years ago
anthropics / evals
View on GitHub
☆415Jul 2, 2024Updated 2 years ago
anthropics / orjson
View on GitHub
Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
☆59May 5, 2026Updated 2 months ago
facebookresearch / decrypto
View on GitHub
Implementation of the Decrypto benchmark for multi-agent reasoning and theory of mind.
☆22Jan 19, 2026Updated 6 months ago
huggingface / ioi
View on GitHub
☆42Mar 26, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
keven980716 / weak-to-strong-deception
View on GitHub
[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆15Jun 21, 2024Updated 2 years ago
twitter-research / lmsoc
View on GitHub
Code for reproducing our paper: LMSOC: An Approach for Socially Sensitive Pretraining
☆13Oct 22, 2021Updated 4 years ago
electron / github-app-auth
View on GitHub
Gets an auth token for a repo via a GitHub app installation
☆16Updated this week
safety-research / false-facts
View on GitHub
☆51Jul 4, 2025Updated last year
gallais / agdARGS
View on GitHub
Dealing with Flags and Options
☆13Sep 10, 2021Updated 4 years ago
LRudL / sad
View on GitHub
Situational Awareness Dataset
☆52Dec 14, 2024Updated last year
alexrs / herd
View on GitHub
Mixture of Expert (MoE) techniques for enhancing LLM performance through expert-driven prompt mapping and adapter combinations.
☆11Feb 11, 2024Updated 2 years ago