centerforaisafety / wmdpLinks
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆134Updated 2 months ago
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆106Updated 5 months ago
- Improving Alignment and Robustness with Circuit Breakers☆225Updated 10 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆59Updated 2 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆299Updated 10 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆85Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆317Updated last year
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆159Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Updated 6 months ago
- ☆179Updated last year
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆162Updated 7 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆114Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Updated last year
- ☆81Updated last month
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆141Updated last year
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆63Updated 2 weeks ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆78Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆58Updated 5 months ago
- ☆56Updated 2 years ago
- ☆32Updated last year
- [ICLR 2025] General-purpose activation steering library☆88Updated 2 weeks ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆52Updated 9 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆64Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆84Updated 3 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆96Updated last year
- A resource repository for representation engineering in large language models☆129Updated 8 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆96Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆82Updated 4 months ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆103Updated last year