anthropics / sycophancy-to-subterfuge-paperView external linksLinks
☆25Sep 5, 2024Updated last year
Alternatives and similar repositories for sycophancy-to-subterfuge-paper
Users that are interested in sycophancy-to-subterfuge-paper are comparing it to the libraries listed below
Sorting:
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆129Mar 9, 2024Updated last year
- Display and customize Markdown text in SwiftUI☆34Jan 28, 2025Updated last year
- ☆12Oct 23, 2022Updated 3 years ago
- Documentation for Pkl packages☆17Feb 4, 2026Updated 2 weeks ago
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆135Sep 14, 2022Updated 3 years ago
- Situational Awareness Dataset☆43Dec 14, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated 10 months ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆18Oct 11, 2024Updated last year
- ☆51Oct 23, 2023Updated 2 years ago
- ☆36Apr 30, 2024Updated last year
- ☆83Oct 8, 2025Updated 4 months ago
- This library supports evaluating disparities in generated image quality, diversity, and consistency between geographic regions.☆20Jun 3, 2024Updated last year
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆23Dec 3, 2024Updated last year
- ☆25Feb 6, 2026Updated last week
- ☆26Feb 14, 2024Updated 2 years ago
- ☆329Jul 2, 2024Updated last year
- ☆29Jul 17, 2023Updated 2 years ago
- ☆346Nov 4, 2024Updated last year
- ☆27Oct 6, 2024Updated last year
- ☆32Nov 20, 2025Updated 2 months ago
- Service for quickly aliasing and redirecting to long URLs☆24Apr 26, 2023Updated 2 years ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆64Oct 27, 2024Updated last year
- A curated list of open-source projects related to MoonshotCoder.☆34May 22, 2024Updated last year
- ☆44Jan 17, 2026Updated last month
- Get up and running with the Gemini API using a simple Journaling App + Angular☆48Updated this week
- Synthetic data derived by templating, few shot prompting, transformations on public domain corpora, and monte carlo tree search.☆32Oct 8, 2025Updated 4 months ago
- Auditing agents for fine-tuning safety☆18Oct 21, 2025Updated 3 months ago
- Chrome Extension that replaces the need for aws-google-auth☆13Sep 21, 2025Updated 4 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆243Dec 16, 2024Updated last year
- ☆131Oct 28, 2023Updated 2 years ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆42Mar 23, 2023Updated 2 years ago
- MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are co…☆23Jan 15, 2026Updated last month
- OLMost every training recipe you need to perform data interventions with the OLMo family of models.☆64Updated this week
- Tusk Drift Demo - Node.js Service☆58Jan 20, 2026Updated 3 weeks ago
- Sample ERC1155☆10Mar 26, 2022Updated 3 years ago
- ☆10Dec 24, 2021Updated 4 years ago
- AI-Rag-ChatBot is a complete project example with RAGChat and Next.js 14, using Upstash Vector Database, Upstash Qstash, Upstash Redis, D…☆13Jul 10, 2025Updated 7 months ago
- ☆16Updated this week
- ☆17Aug 5, 2025Updated 6 months ago