centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆99Updated 9 months ago
Alternatives and similar repositories for wmdp:
Users that are interested in wmdp are comparing it to the libraries listed below
- ☆52Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆82Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆181Updated 4 months ago
- ☆31Updated 4 months ago
- A resource repository for representation engineering in large language models☆99Updated 3 months ago
- Landing Page for TOFU☆113Updated this week
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆54Updated last month
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated 11 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 9 months ago
- ☆17Updated 4 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆69Updated 11 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆77Updated 9 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆87Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆58Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆70Updated 7 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆46Updated 3 weeks ago
- General-purpose activation steering library☆43Updated last month
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆84Updated 8 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆35Updated 3 weeks ago
- ☆31Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆39Updated 3 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆68Updated 4 months ago
- Steering Llama 2 with Contrastive Activation Addition☆122Updated 8 months ago
- ☆30Updated last month
- ☆41Updated last week
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆88Updated 8 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆139Updated 9 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆116Updated 6 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆110Updated 8 months ago
- ☆36Updated last year