google-research-datasets / adversarial-nibbler

This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

☆16

Related projects: ⓘ

azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆56Updated 7 months ago
SheltonLiu-N / Universal-Prompt-Injection
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆27Updated 5 months ago
chawins / pal
PAL: Proxy-Guided Black-Box Attack on Large Language Models
☆45Updated last month
amazon-science / controlling-llm-memorization
☆30Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆59Updated 6 months ago
papersPapers / BadPrompt
Code for the paper "BadPrompt: Backdoor Attacks on Continuous Prompts"
☆32Updated 2 months ago
facebookresearch / text-adversarial-attack
Repo for arXiv preprint "Gradient-based Adversarial Attacks against Text Transformers"
☆98Updated last year
Vaidehi99 / InfoDeletionAttacks
☆37Updated 10 months ago
leix28 / prompt-universal-vulnerability
Implementation of the paper "Exploring the Universal Vulnerability of Prompt-based Learning Paradigm" on Findings of NAACL 2022
☆26Updated 2 years ago
ebagdasa / propaganda_as_a_service
Code for paper: "Spinning Language Models: Risks of Propaganda-as-a-Service and Countermeasures"
☆21Updated 2 years ago
ethz-spylab / agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
☆48Updated last week
SolidShen / RIPPLE_official
☆19Updated 7 months ago
weichen-yu / LM-Extraction
☆38Updated last year
TrustAIResearch / MLHospital
☆42Updated last year
ebagdasa / multimodal_injection
☆59Updated 11 months ago
xinleihe / toxic-prompt
☆16Updated 10 months ago
SORRY-Bench / sorry-bench
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
☆30Updated 2 months ago
lancopku / agent-backdoor-attacks
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents"
☆29Updated 3 months ago
ejones313 / auditing-llms
☆47Updated last year
lancopku / Embedding-Poisoning
Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…
☆34Updated 3 years ago
sail-sg / AnyDoor
AnyDoor: Test-Time Backdoor Attacks on Multimodal Large Language Models
☆39Updated 5 months ago
BrachioLab / adversarial_prompting
☆53Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆39Updated 4 months ago
ChenWu98 / agent-attack
[Arxiv 2024] Adversarial attacks on multimodal agents
☆33Updated 2 months ago
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆89Updated 2 months ago
kztakemoto / simbaja
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
☆11Updated 4 months ago
weizeming / momentum-attack-llm
☆12Updated 4 months ago
yjw1029 / Self-Reminder
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆40Updated 10 months ago
hlzhang109 / impossibility-watermark
☆18Updated last week
thestephencasper / explore_establish_exploit_llms
☆30Updated last year