☆87Oct 8, 2025Updated 5 months ago
Alternatives and similar repositories for alignment_faking_public
Users that are interested in alignment_faking_public are comparing it to the libraries listed below
Sorting:
- ☆26Sep 5, 2024Updated last year
- ☆22Sep 9, 2021Updated 4 years ago
- ☆10Feb 25, 2020Updated 6 years ago
- ☆35Feb 20, 2025Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆134Mar 9, 2024Updated 2 years ago
- ☆20Nov 15, 2024Updated last year
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- A tiny easily hackable implementation of a feature dashboard.☆15Oct 21, 2025Updated 4 months ago
- Interpreting Learned Search and Planning: Reverse-engineering recurrent convolutional networks (DRC) that play Sokoban☆17Jun 29, 2025Updated 8 months ago
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- [COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free☆54Apr 6, 2025Updated 11 months ago
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- Tools for studying developmental interpretability in neural networks.☆127Dec 30, 2025Updated 2 months ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆19Oct 11, 2024Updated last year
- Repository for the "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" paper☆31Nov 28, 2025Updated 3 months ago
- Automatically create Anki cards from text using language models☆20Jan 7, 2023Updated 3 years ago
- ☆36Apr 30, 2024Updated last year
- ☆18Feb 25, 2026Updated last week
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year
- Implementation of Influence Function approximations for differently sized ML models, using PyTorch☆16Sep 15, 2023Updated 2 years ago
- A library for mechanistic anomaly detection☆22Jan 9, 2025Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆20Jan 19, 2025Updated last year
- Einsum with einops style variable names☆18May 16, 2024Updated last year
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Updated this week
- Tools for understanding how transformer predictions are built layer-by-layer☆570Aug 7, 2025Updated 7 months ago
- gpt completions in vscode☆35Mar 24, 2023Updated 2 years ago
- (Model-written) LLM evals library☆18Dec 13, 2024Updated last year
- ☆24Feb 17, 2026Updated 2 weeks ago
- A formalisation of Cartesian Frames, a perspective on embedded agency, in the HOL theorem prover.☆20Dec 20, 2021Updated 4 years ago
- [ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆28Apr 2, 2025Updated 11 months ago
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆23Dec 3, 2024Updated last year
- ☆22Nov 11, 2024Updated last year
- A gym environment for Stuart Armstrong's model of a treacherous turn.☆18Jul 28, 2018Updated 7 years ago
- ☆25Nov 11, 2025Updated 3 months ago
- ☆399Aug 21, 2025Updated 6 months ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆160Feb 27, 2026Updated last week
- Code for Continual Reinforcement Learning with Multi-Timescale Replay☆24Apr 16, 2020Updated 5 years ago
- Author's PyTorch implementation of SR-DICE for marginalized importance sampling☆28Dec 7, 2021Updated 4 years ago