centerforaisafety / wmdpView external linksLinks
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆158May 29, 2025Updated 8 months ago
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- ☆27Oct 6, 2024Updated last year
- Official repo for EMNLP'24 paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"☆29Oct 1, 2024Updated last year
- [ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models☆44Dec 4, 2024Updated last year
- [NeurIPS D&B '25] The one-stop repository for LLM unlearning☆479Dec 24, 2025Updated last month
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆90Sep 30, 2024Updated last year
- Official repo for NeurIPS'24 paper "WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models"☆18Dec 16, 2024Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- ☆44Oct 1, 2024Updated last year
- ☆73Jul 15, 2024Updated last year
- LLM Unlearning☆181Oct 20, 2023Updated 2 years ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆49Jan 15, 2026Updated last month
- ☆35May 9, 2025Updated 9 months ago
- A resource repository for machine unlearning in large language models☆534Jan 6, 2026Updated last month
- ☆33Mar 13, 2025Updated 11 months ago
- ☆19Jun 21, 2025Updated 7 months ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆854Aug 16, 2024Updated last year
- ☆44Mar 3, 2023Updated 2 years ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆338Feb 23, 2024Updated last year
- ☆185Nov 17, 2025Updated 2 months ago
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆20Oct 18, 2024Updated last year
- ☆12Oct 23, 2022Updated 3 years ago
- ☆20Nov 15, 2024Updated last year
- ☆21Jun 22, 2025Updated 7 months ago
- ☆34Feb 11, 2025Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆209May 23, 2024Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆127Feb 24, 2025Updated 11 months ago
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- Code repo for the model organisms and convergent directions of EM papers.☆49Sep 22, 2025Updated 4 months ago
- ☆146Jul 23, 2025Updated 6 months ago
- Butler 是一个用于自动化服务管理和任务调度的工具项目。☆15Updated this week
- ☆47Sep 29, 2024Updated last year
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆83Dec 21, 2024Updated last year
- [EMNLP 2024] "Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective"☆32Jul 22, 2024Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆165Jun 25, 2025Updated 7 months ago
- This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"☆47Oct 13, 2025Updated 4 months ago
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆57Oct 30, 2025Updated 3 months ago
- Official code for ICML 2024 paper, "Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models"☆19Jun 12, 2024Updated last year