centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆115Updated last year
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- Improving Alignment and Robustness with Circuit Breakers☆201Updated 7 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆96Updated 2 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆111Updated 11 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆61Updated 4 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆55Updated 2 months ago
- ☆54Updated 2 years ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆86Updated 11 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- ☆59Updated 3 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆99Updated 2 months ago
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆74Updated last week
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆75Updated 5 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆92Updated 11 months ago
- Steering Llama 2 with Contrastive Activation Addition☆148Updated 11 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆82Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆76Updated last month
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆129Updated 9 months ago
- ☆170Updated last year
- A resource repository for representation engineering in large language models☆120Updated 5 months ago
- [ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)☆78Updated 6 months ago
- ☆33Updated 4 months ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆152Updated 4 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆72Updated 2 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆153Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆115Updated 3 weeks ago
- ☆31Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆62Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆215Updated 7 months ago
- Python package for measuring memorization in LLMs.☆151Updated 5 months ago