Breakend / SelfDestructingModels
☆12Updated last year
Alternatives and similar repositories for SelfDestructingModels:
Users that are interested in SelfDestructingModels are comparing it to the libraries listed below
- Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆42Updated 3 months ago
- ☆30Updated last month
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 8 months ago
- ☆51Updated last year
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆17Updated 10 months ago
- Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]☆19Updated 9 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 7 months ago
- This is an official repository for "LAVA: Data Valuation without Pre-Specified Learning Algorithms" (ICLR2023).☆45Updated 7 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆93Updated 8 months ago
- Official Repository for ICML 2023 paper "Can Neural Network Memorization Be Localized?"☆17Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆110Updated 7 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆13Updated 5 months ago
- A library for mechanistic anomaly detection☆17Updated last week
- Landing Page for TOFU☆107Updated last month
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆45Updated 8 months ago
- ☆17Updated last month
- ☆19Updated 5 months ago
- ☆15Updated last month
- ☆41Updated this week
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆43Updated last year
- Algebraic value editing in pretrained language models☆62Updated last year
- ☆34Updated last year
- ☆31Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated 10 months ago
- Sparse Autoencoder Training Library☆38Updated 2 months ago
- ☆14Updated 8 months ago
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆29Updated last year
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆27Updated 7 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆77Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆58Updated 11 months ago