centerforaisafety / wmdpLinks
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆121Updated last week
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- Improving Alignment and Robustness with Circuit Breakers☆208Updated 8 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆97Updated 3 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆84Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆61Updated 4 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆93Updated last year
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆155Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆93Updated last year
- ☆64Updated 3 weeks ago
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆157Updated 2 months ago
- ☆54Updated 2 years ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆63Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆56Updated 3 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆77Updated 6 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆72Updated 2 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆51Updated 7 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆75Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆54Updated 3 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆113Updated 11 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆69Updated last year
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆89Updated last year
- A resource repository for representation engineering in large language models☆124Updated 6 months ago
- Steering Llama 2 with Contrastive Activation Addition☆155Updated last year
- General-purpose activation steering library☆75Updated 3 weeks ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆64Updated 7 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆133Updated 10 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆103Updated 3 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆88Updated 3 months ago
- ☆173Updated last year
- LLM Unlearning☆162Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆48Updated 7 months ago