centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆110Updated 11 months ago
Alternatives and similar repositories for wmdp:
Users that are interested in wmdp are comparing it to the libraries listed below
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆96Updated last month
- Improving Alignment and Robustness with Circuit Breakers☆196Updated 6 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆53Updated last month
- A resource repository for representation engineering in large language models☆117Updated 5 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆60Updated 3 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆86Updated 11 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆73Updated 4 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆68Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆76Updated 3 weeks ago
- ☆169Updated last year
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 11 months ago
- ☆54Updated 2 years ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆89Updated 9 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆151Updated 11 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆74Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆143Updated 11 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆80Updated 11 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆61Updated 5 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆71Updated last month
- ☆128Updated 3 weeks ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆129Updated 9 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆111Updated 10 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆73Updated last year
- ☆57Updated 9 months ago
- ☆52Updated 2 months ago
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆51Updated last month
- Steering vectors for transformer language models in Pytorch / Huggingface☆94Updated 2 months ago
- ☆33Updated 4 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆51Updated 6 months ago