centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
☆72Updated 4 months ago
Related projects: ⓘ
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆55Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆124Updated 2 months ago
- ☆47Updated last year
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆50Updated this week
- ☆30Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆56Updated 7 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆76Updated 3 weeks ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆61Updated 4 months ago
- Landing Page for TOFU☆79Updated 3 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆62Updated 6 months ago
- Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆34Updated 3 weeks ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆48Updated 5 months ago
- ☆37Updated 10 months ago
- AI Logging for Interpretability and Explainability🔬☆74Updated 3 months ago
- Parsimonious Concept Engineering (PaCE) uses sparse coding on a large-scale concept dictionary to effectively improve the trustworthiness…☆25Updated 3 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆70Updated 5 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆77Updated 4 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆46Updated last month
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆73Updated 11 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆79Updated 3 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆59Updated 6 months ago
- A resource repository for representation engineering in large language models☆36Updated last week
- Python package for measuring memorization in LLMs.☆107Updated this week
- ☆143Updated 9 months ago
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆64Updated 6 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆107Updated last month
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆41Updated 4 months ago
- A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆106Updated 5 months ago
- An Open Robustness Benchmark for Jailbreaking Language Models [arXiv 2024]☆169Updated last month
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆57Updated 6 months ago