qizhangli / Gradient-based-Jailbreak-AttacksLinks

Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs

☆12

Alternatives and similar repositories for Gradient-based-Jailbreak-Attacks

Users that are interested in Gradient-based-Jailbreak-Attacks are comparing it to the libraries listed below

Sorting:

NY1024 / BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt
☆53Updated last year
Alibaba-AAIG / Oyster
The Oyster series is a set of safety models developed in-house by Alibaba-AAIG, devoted to building a responsible AI ecosystem. | Oyster …
☆56Updated 2 months ago
AISafety-HKUST / Backdoor_Safety_Tuning
Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)
☆27Updated last year
thu-ml / Attack-Bard
☆107Updated last year
euanong / image-hijacks
Official codebase for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
☆53Updated 2 years ago
weizeming / momentum-attack-llm
☆23Updated 10 months ago
roywang021 / UMK
Code for ACM MM2024 paper: White-box Multimodal Jailbreaks Against Large Vision-Language Models
☆30Updated 11 months ago
weizeming / ICML-2024-SAM-AT
☆24Updated last year
lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆59Updated last year
bboylyg / RNP
Reconstructive Neuron Pruning for Backdoor Defense (ICML 2023)
☆39Updated last year
abc03570128 / Jailbreaking-Attack-against-Multimodal-Large-Language-Model
☆52Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆62Updated last year
Bai-YT / AdaptiveSmoothing
Implementation of the paper "Improving the Accuracy-Robustness Trade-off of Classifiers via Adaptive Smoothing".
☆10Updated last year
Unispac / Fight-Poison-With-Poison
Code repository for the paper --- [USENIX Security 2023] Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
☆30Updated 2 years ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆23Updated 11 months ago
PKU-ML / PAT
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆21Updated 6 months ago
SchwinnL / LLM_Embedding_Attack
Code to conduct an embedding attack on LLMs
☆28Updated 10 months ago
XuanChen-xc / RLbreaker
Code for "When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search" (NeurIPS 2024)
☆14Updated last year
yuplin2333 / representation-space-jailbreak
Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…
☆22Updated last year
ethz-spylab / autoadvexbench
☆33Updated 6 months ago
byerose / Awesome-Foundation-Model-Security
A curated list of trustworthy Generative AI papers. Daily updating...
☆75Updated last year
IBM / SafeLoRA
Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"
☆23Updated 2 months ago
BrachioLab / adversarial_prompting
☆53Updated 2 years ago
YiZeng623 / frequency-backdoor
ICCV 2021, We find most existing triggers of backdoor attacks in deep learning contain severe artifacts in the frequency domain. This Rep…
☆45Updated 3 years ago
rotaryhammer / code-autodan
An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
☆44Updated last year
RUCAIBox / HADES
[ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking …
☆33Updated last year
thunxxx / MLLM-Jailbreak-evaluation-MMJ-Bench
☆66Updated 8 months ago
RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer
Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
☆33Updated 5 months ago
KuofengGao / Verbose_Images
[ICLR 2024] Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
☆40Updated last year
HKUST-KnowComp / LLM-Multistep-Jailbreak
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆35Updated 2 years ago