Jayfeather1024 / Backdoor-Enhanced-Alignment
☆10Updated 8 months ago
Related projects ⓘ
Alternatives and complementary repositories for Backdoor-Enhanced-Alignment
- [ICLR 2022] Boosting Randomized Smoothing with Variance Reduced Classifiers☆12Updated 2 years ago
- ☆14Updated 6 months ago
- [CVPR 2022] "Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free" by Tianlong Chen*, Zhenyu Zhang*, Yihua Zhang*, Shiyu C…☆25Updated 2 years ago
- [ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang☆15Updated last year
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆12Updated 3 months ago
- Not All Poisons are Created Equal: Robust Training against Data Poisoning (ICML 2022)☆17Updated 2 years ago
- Code relative to "Adversarial robustness against multiple and single $l_p$-threat models via quick fine-tuning of robust classifiers"☆15Updated last year
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆41Updated 6 months ago
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆17Updated 8 months ago
- [NeurIPS 2022] "Randomized Channel Shuffling: Minimal-Overhead Backdoor Attack Detection without Clean Datasets" by Ruisi Cai*, Zhenyu Zh…☆19Updated 2 years ago
- [SafeGenAi @ NeurIPS 2024] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates☆60Updated 3 weeks ago
- ☆53Updated last year
- ☆38Updated last year
- [NeurIPS'22] Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork. Haotao Wang, Junyuan Hong,…☆14Updated 11 months ago
- Code for the paper "BadPrompt: Backdoor Attacks on Continuous Prompts"☆36Updated 4 months ago
- kyleliang919 / Uncovering-the-Connections-BetweenAdversarial-Transferability-and-Knowledge-Transferabilitycode for ICML 2021 paper in which we explore the relationship between adversarial transferability and knowledge transferability.☆17Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆28Updated 4 months ago
- ☆29Updated 2 years ago
- Github repo for One-shot Neural Backdoor Erasing via Adversarial Weight Masking (NeurIPS 2022)☆14Updated last year
- [ICLR 2023, Spotlight] Indiscriminate Poisoning Attacks on Unsupervised Contrastive Learning☆28Updated 11 months ago
- ☆20Updated last year
- ☆13Updated 4 months ago
- [ICLR 2022 official code] Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?☆26Updated 2 years ago
- Code for Arxiv When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?☆15Updated 2 months ago
- Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)☆24Updated this week
- Robustify Black-Box Models (ICLR'22 - Spotlight)☆24Updated last year
- ☆26Updated 3 weeks ago
- Official code for FAccT'21 paper "Fairness Through Robustness: Investigating Robustness Disparity in Deep Learning" https://arxiv.org/abs…☆12Updated 3 years ago
- On the Loss Landscape of Adversarial Training: Identifying Challenges and How to Overcome Them [NeurIPS 2020]☆35Updated 3 years ago
- ☆18Updated 6 months ago