anthropics / sleeper-agents-paperLinks

Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".

☆111

Alternatives and similar repositories for sleeper-agents-paper

Users that are interested in sleeper-agents-paper are comparing it to the libraries listed below

Sorting:

google-deepmind / dangerous-capability-evaluations
☆55Updated this week
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆225Updated 10 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆132Updated 2 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆113Updated last year
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆114Updated last year
METR / RE-Bench
☆95Updated 3 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆119Updated 5 months ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆152Updated last year
aypan17 / machiavelli
☆137Updated 2 weeks ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆112Updated last month
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆249Updated last month
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆96Updated last year
redwoodresearch / alignment_faking_public
☆70Updated 2 months ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆72Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆157Updated 3 months ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆198Updated this week
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆86Updated last year
redwoodresearch / Text-Steganography-Benchmark
Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
☆22Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆200Updated this week
anthropics / evals
☆287Updated last year
princeton-pli / hal-harness
☆102Updated this week
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆54Updated 3 months ago
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆195Updated 8 months ago
METR / task-standard
METR Task Standard
☆156Updated 6 months ago
andyzoujm / breaking-llama-guard
Code to break Llama Guard
☆31Updated last year
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 10 months ago