Breakend / SelfDestructingModels
☆12Updated last year
Related projects ⓘ
Alternatives and complementary repositories for SelfDestructingModels
- Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆38Updated 3 weeks ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆42Updated 6 months ago
- ☆26Updated 2 weeks ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆12Updated 3 months ago
- Algebraic value editing in pretrained language models☆57Updated last year
- ☆102Updated last month
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆81Updated 6 months ago
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆26Updated 5 months ago
- Code for the paper "Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression"☆20Updated last year
- ☆59Updated 2 years ago
- ☆49Updated last year
- ☆13Updated 2 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆61Updated 10 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆107Updated 5 months ago
- ☆31Updated last year
- ☆24Updated 7 months ago
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆42Updated last year
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- ☆54Updated 2 years ago
- ☆32Updated last year
- ☆43Updated 4 months ago
- Pytorch Datasets for Easy-To-Hard☆25Updated 2 years ago
- Official Repository for ICML 2023 paper "Can Neural Network Memorization Be Localized?"☆16Updated last year
- ☆18Updated last month
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆25Updated 5 months ago
- ☆22Updated last year
- Sparse Autoencoder Training Library☆26Updated 2 weeks ago
- ☆61Updated 2 years ago
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆17Updated 8 months ago
- ☆187Updated last month