google-research-datasets / adversarial-nibblerLinks
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
☆24Updated 10 months ago
Alternatives and similar repositories for adversarial-nibbler
Users that are interested in adversarial-nibbler are comparing it to the libraries listed below
Sorting:
- ☆48Updated 10 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆66Updated last year
- ☆43Updated 2 years ago
- Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]☆102Updated last year
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆67Updated last year
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆66Updated last year
- ☆24Updated 2 years ago
- ☆57Updated last year
- Official Repository for Dataset Inference for LLMs☆43Updated last year
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆172Updated last year
- Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…☆43Updated 4 years ago
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆55Updated last year
- ☆44Updated 2 years ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Updated 4 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆120Updated 9 months ago
- ☆59Updated 2 years ago
- ☆23Updated last year
- NeurIPS'24 - LLM Safety Landscape☆34Updated last month
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆48Updated 11 months ago
- ☆20Updated 6 months ago
- ☆23Updated 11 months ago
- Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT☆35Updated 2 years ago
- Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"☆20Updated 2 years ago
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆72Updated 9 months ago
- ☆38Updated 2 years ago
- 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Con…☆49Updated last year
- This is the official Gtihub repo for our paper: "BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Lang…☆20Updated last year
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆32Updated last year
- ☆114Updated 2 years ago
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆96Updated last year