centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆95Updated 9 months ago
Alternatives and similar repositories for wmdp:
Users that are interested in wmdp are comparing it to the libraries listed below
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆79Updated last year
- ☆52Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆176Updated 4 months ago
- ☆31Updated 4 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆44Updated last week
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆76Updated 8 months ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆39Updated 3 months ago
- Landing Page for TOFU☆108Updated last month
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆67Updated 10 months ago
- A resource repository for representation engineering in large language models☆98Updated 2 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆67Updated 6 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆82Updated 8 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆53Updated 2 weeks ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆84Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆88Updated 8 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆62Updated 2 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆134Updated 8 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 9 months ago
- ☆31Updated last year
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆68Updated 3 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated 11 months ago
- ☆40Updated last year
- ☆17Updated 3 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆110Updated 7 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆33Updated last week
- ☆14Updated 7 months ago
- Steering Llama 2 with Contrastive Activation Addition☆119Updated 8 months ago
- ☆30Updated last month
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆58Updated 11 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆13Updated 5 months ago