centerforaisafety / wmdpLinks
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆156Updated 6 months ago
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- Improving Alignment and Robustness with Circuit Breakers☆248Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆116Updated 9 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆65Updated 6 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆171Updated last year
- ☆189Updated 2 years ago
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆90Updated 7 months ago
- ☆59Updated 2 years ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆332Updated last year
- A Comprehensive Assessment of Trustworthiness in GPT Models☆310Updated last year
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆172Updated 11 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆115Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆70Updated 9 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆152Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Updated last year
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆76Updated 4 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆89Updated last year
- A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022☆35Updated 2 years ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Updated 11 months ago
- ☆116Updated 5 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆84Updated last year
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90Updated last year
- ☆46Updated last year
- NeurIPS'24 - LLM Safety Landscape☆33Updated last month
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆166Updated 7 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆66Updated last year
- A resource repository for representation engineering in large language models☆143Updated last year
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆79Updated last year
- Python package for measuring memorization in LLMs.☆175Updated 4 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆151Updated 5 months ago
- ☆38Updated last year