☆92Oct 8, 2025Updated 6 months ago
Alternatives and similar repositories for alignment_faking_public
Users that are interested in alignment_faking_public are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆28Sep 5, 2024Updated last year
- ☆35Feb 20, 2025Updated last year
- A quick way to get started with Transformer Lens☆14Dec 13, 2023Updated 2 years ago
- This git repository outlines three scoring rules that I believe might serve current forecasting platforms better than current alternative…☆18Apr 18, 2022Updated 3 years ago
- ☆20Nov 15, 2024Updated last year
- NordVPN Special Discount Offer • AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- A library for mechanistic anomaly detection☆22Jan 9, 2025Updated last year
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Feb 20, 2024Updated 2 years ago
- [COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free☆55Apr 6, 2025Updated last year
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆19Oct 11, 2024Updated last year
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year
- A tiny easily hackable implementation of a feature dashboard.☆16Oct 21, 2025Updated 5 months ago
- ☆13Jul 12, 2024Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆259Sep 24, 2024Updated last year
- NordVPN Threat Protection Pro™ • AdTake your cybersecurity to the next level. Block phishing, malware, trackers, and ads. Lightweight app that works with all browsers.
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆174Mar 31, 2026Updated last week
- ☆36Apr 30, 2024Updated last year
- ☆10Feb 25, 2020Updated 6 years ago
- Trains small LMs. Designed for training on SimpleStories☆12Sep 15, 2025Updated 6 months ago
- ☆25Nov 11, 2025Updated 4 months ago
- [ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆29Apr 2, 2025Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆141Mar 9, 2024Updated 2 years ago
- ☆18Mar 31, 2026Updated last week
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆20Jan 19, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Tools for understanding how transformer predictions are built layer-by-layer☆580Aug 7, 2025Updated 8 months ago
- Repository for the "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" paper☆32Mar 31, 2026Updated last week
- Situational Awareness Dataset☆49Dec 14, 2024Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆252Feb 27, 2026Updated last month
- ☆407Aug 21, 2025Updated 7 months ago
- ☆29Apr 4, 2024Updated 2 years ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Apr 3, 2026Updated last week
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆25Dec 3, 2024Updated last year
- (Model-written) LLM evals library☆18Dec 13, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.☆25Jan 26, 2024Updated 2 years ago
- ☆277Jan 12, 2026Updated 2 months ago
- ☆22Sep 9, 2021Updated 4 years ago
- ☆58Nov 19, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated last year
- An active inference model of Lacanian psychoanalysis☆16Jun 7, 2025Updated 10 months ago
- Interpreting Learned Search and Planning: Reverse-engineering recurrent convolutional networks (DRC) that play Sokoban☆18Jun 29, 2025Updated 9 months ago