centerforaisafety / wmdpLinks
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆127Updated 3 weeks ago
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- Improving Alignment and Robustness with Circuit Breakers☆214Updated 9 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆98Updated 4 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆134Updated 11 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆61Updated 5 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90Updated last year
- ☆54Updated 2 years ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆85Updated last year
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆157Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆55Updated 3 months ago
- ☆68Updated last month
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆82Updated 6 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆58Updated 2 weeks ago
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆158Updated 2 months ago
- Papers about red teaming LLMs and Multimodal models.☆123Updated 3 weeks ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆94Updated last year
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆51Updated 8 months ago
- A resource repository for representation engineering in large language models☆126Updated 7 months ago
- Steering Llama 2 with Contrastive Activation Addition☆158Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆49Updated 8 months ago
- ☆39Updated 7 months ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆152Updated 6 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆76Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆113Updated last year
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆245Updated last month
- Steering vectors for transformer language models in Pytorch / Huggingface☆108Updated 4 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆65Updated 7 months ago
- General-purpose activation steering library☆78Updated last month
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆100Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆76Updated last month
- Python package for measuring memorization in LLMs.☆159Updated 7 months ago