anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆102Updated last year
Alternatives and similar repositories for sleeper-agents-paper
Users that are interested in sleeper-agents-paper are comparing it to the libraries listed below
Sorting:
- ☆54Updated 7 months ago
- Improving Alignment and Robustness with Circuit Breakers☆203Updated 7 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆105Updated last year
- ☆129Updated last month
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆142Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆100Updated 2 months ago
- Collection of evals for Inspect AI☆132Updated this week
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆111Updated 11 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆68Updated last year
- Code to break Llama Guard☆31Updated last year
- ☆74Updated 3 weeks ago
- ☆58Updated 4 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆217Updated 7 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 7 months ago
- ☆100Updated 2 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆119Updated last year
- METR Task Standard☆146Updated 3 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆75Updated last year
- Dataset for the Tensor Trust project☆40Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆79Updated last month
- RuLES: a benchmark for evaluating rule-following in language models☆223Updated 2 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆76Updated 5 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆172Updated this week
- Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.☆19Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆74Updated 2 weeks ago
- ☆59Updated 3 months ago
- Measuring the situational awareness of language models☆34Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 2 months ago