google-research-datasets / adversarial-nibblerLinks
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
☆23Updated 5 months ago
Alternatives and similar repositories for adversarial-nibbler
Users that are interested in adversarial-nibbler are comparing it to the libraries listed below
Sorting:
- ☆44Updated 5 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆64Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆51Updated 8 months ago
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆55Updated last year
- ☆44Updated 2 years ago
- An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)☆92Updated 5 months ago
- Official Repository for Dataset Inference for LLMs☆35Updated 11 months ago
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆51Updated 11 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆97Updated 4 months ago
- ☆44Updated 2 years ago
- Code for "Universal Adversarial Triggers Are Not Universal."☆17Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆62Updated 6 months ago
- ☆40Updated 9 months ago
- ☆21Updated 6 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆128Updated last month
- ☆19Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆56Updated 4 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆157Updated last year
- ☆55Updated 2 years ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆157Updated 7 months ago
- The opensoure repository of FuzzLLM☆27Updated last year
- ☆16Updated last month
- [ICLR 2024] Provable Robust Watermarking for AI-Generated Text☆33Updated last year
- Official repository for "PostMark: A Robust Blackbox Watermark for Large Language Models"☆27Updated 10 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆85Updated last year
- AnyDoor: Test-Time Backdoor Attacks on Multimodal Large Language Models☆55Updated last year
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆29Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆69Updated last year
- [EMNLP 2022] Distillation-Resistant Watermarking (DRW) for Model Protection in NLP☆13Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆94Updated last year