mireshghallah / neighborhood-curvature-mia
☆20Updated last year
Related projects ⓘ
Alternatives and complementary repositories for neighborhood-curvature-mia
- [ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang☆15Updated last year
- ☆38Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆29Updated 4 months ago
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆41Updated 7 months ago
- ☆40Updated last year
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆19Updated last week
- ☆36Updated 4 months ago
- Implementation of the paper "Exploring the Universal Vulnerability of Prompt-based Learning Paradigm" on Findings of NAACL 2022☆27Updated 2 years ago
- Code for the paper "BadPrompt: Backdoor Attacks on Continuous Prompts"☆36Updated 4 months ago
- ☆16Updated 4 months ago
- Codebase for decoding compressed trust.☆20Updated 6 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆73Updated 2 months ago
- ☆13Updated last month
- Official code for the paper: Evaluating Copyright Takedown Methods for Language Models☆15Updated 4 months ago
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆47Updated last month
- ☆22Updated 11 months ago
- ☆6Updated 2 years ago
- Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT☆26Updated last year
- ☆10Updated 4 years ago
- SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors☆34Updated 4 months ago
- Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)☆21Updated 2 years ago
- Official Repository for Dataset Inference for LLMs☆23Updated 3 months ago
- ☆21Updated last month
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆12Updated 3 months ago
- Code for Arxiv When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?☆15Updated this week
- ☆49Updated last year
- [EMNLP 2022] Distillation-Resistant Watermarking (DRW) for Model Protection in NLP☆12Updated last year
- Official repo for the paper: Recovering Private Text in Federated Learning of Language Models (in NeurIPS 2022)☆57Updated last year
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆39Updated last month
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆48Updated 3 months ago