Breakend / SelfDestructingModels

☆12

Alternatives and similar repositories for SelfDestructingModels:

Users that are interested in SelfDestructingModels are comparing it to the libraries listed below

rishub-tamirisa / tamper-resistance
Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆42Updated 3 months ago
locuslab / acr-memorization
☆30Updated last month
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated 8 months ago
ejones313 / auditing-llms
☆51Updated last year
vfleaking / PTST
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆17Updated 10 months ago
ethz-spylab / realistic-adv-examples
Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]
☆19Updated 9 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆26Updated 7 months ago
reds-lab / LAVA
This is an official repository for "LAVA: Data Valuation without Pre-Specified Learning Algorithms" (ICLR2023).
☆45Updated 7 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆93Updated 8 months ago
pratyushmaini / localizing-memorization
Official Repository for ICML 2023 paper "Can Neural Network Memorization Be Localized?"
☆17Updated last year
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆110Updated 7 months ago
SchwinnL / circuit-breakers-eval
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆13Updated 5 months ago
ejnnr / cupbearer
A library for mechanistic anomaly detection
☆17Updated last week
locuslab / tofu
Landing Page for TOFU
☆107Updated last month
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆45Updated 8 months ago
hannamw / EAP-IG
☆17Updated last month
thestephencasper / latent_adversarial_training
☆19Updated 5 months ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆15Updated last month
adamkarvonen / SAEBench
☆41Updated this week
tml-epfl / sharpness-vs-generalization
A modern look at the relationship between sharpness and generalization [ICML 2023]
☆43Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆62Updated last year
milesaturpin / cot-unfaithfulness
☆34Updated last year
thestephencasper / explore_establish_exploit_llms
☆31Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆66Updated 10 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆38Updated 2 months ago
HugoFry / mats_sae_training_for_ViTs
☆14Updated 8 months ago
YanNeu / spurious_imagenet
Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet
☆29Updated last year
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆27Updated 7 months ago
paul-rottger / exaggerated-safety
Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆77Updated last year
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆58Updated 11 months ago