WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆160May 29, 2025Updated 9 months ago
Alternatives and similar repositories for wmdp
Users that are interested in wmdp are comparing it to the libraries listed below
Sorting:
- ☆26Oct 6, 2024Updated last year
- Official repo for EMNLP'24 paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"☆29Oct 1, 2024Updated last year
- [ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models☆45Dec 4, 2024Updated last year
- [NeurIPS D&B '25] The one-stop repository for LLM unlearning☆497Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- ☆32Aug 9, 2024Updated last year
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆91Sep 30, 2024Updated last year
- Official repo for NeurIPS'24 paper "WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models"☆18Dec 16, 2024Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- [NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"☆42Oct 3, 2025Updated 5 months ago
- ☆44Oct 1, 2024Updated last year
- ☆73Jul 15, 2024Updated last year
- LLM Unlearning☆182Oct 20, 2023Updated 2 years ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆49Jan 15, 2026Updated last month
- ☆36May 9, 2025Updated 10 months ago
- A resource repository for machine unlearning in large language models☆542Feb 24, 2026Updated last week
- ☆32Mar 13, 2025Updated 11 months ago
- ☆19Jun 21, 2025Updated 8 months ago
- Official Implementation of "Learning to Refuse: Towards Mitigating Privacy Risks in LLMs"☆10Dec 13, 2024Updated last year
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆875Aug 16, 2024Updated last year
- [ICLR 2025] FLAT: LLM Unlearning via Loss Adjustment with Only Forget Data☆14Feb 26, 2025Updated last year
- ☆44Mar 3, 2023Updated 3 years ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆343Feb 23, 2024Updated 2 years ago
- ☆185Nov 17, 2025Updated 3 months ago
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆20Oct 18, 2024Updated last year
- ☆20Nov 15, 2024Updated last year
- Constructing community of LLM-based Agent in the minecraft☆17Nov 27, 2025Updated 3 months ago
- ☆12Oct 23, 2022Updated 3 years ago
- ☆21Jun 22, 2025Updated 8 months ago
- Steering Llama 2 with Contrastive Activation Addition☆213May 23, 2024Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆129Feb 24, 2025Updated last year
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- ☆147Jul 23, 2025Updated 7 months ago
- ☆48Sep 29, 2024Updated last year
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆83Dec 21, 2024Updated last year
- Code repo for the model organisms and convergent directions of EM papers.☆53Sep 22, 2025Updated 5 months ago
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆76Mar 1, 2025Updated last year
- Butler 是一个用于自动化服务管理和任务调度的工具项目。☆16Mar 2, 2026Updated last week