Aegis1863 / xJailbreakLinks

Code of paper: xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking"

☆16

Alternatives and similar repositories for xJailbreak

Users that are interested in xJailbreak are comparing it to the libraries listed below

Sorting:

AI45Lab / X-Boundary
The code repo of paper "X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usa…
☆37Updated last month
AI45Lab / ActorAttack
☆118Updated 11 months ago
facebookresearch / multimodal-fusion-jailbreaks
Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)
☆19Updated last year
CryptoAILab / FigStep
[AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts
☆187Updated 6 months ago
xirui-li / DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆66Updated last year
NY1024 / Jailbreak_GPT4o
☆26Updated last year
thunxxx / MLLM-Jailbreak-evaluation-MMJ-Bench
☆68Updated 9 months ago
Vinsonzyh / BlueSuffix
[ICLR 2025] BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
☆30Updated 2 months ago
theshi-1128 / jailbreak-bench
The most comprehensive and accurate LLM jailbreak attack benchmark by far
☆21Updated 9 months ago
tmlr-group / DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆166Updated last year
niconi19 / LLM-Conversation-Safety
[NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
☆109Updated last year
uw-nsl / ArtPrompt
[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`
☆90Updated 4 months ago
Allen-piexl / JailbreakZoo
☆159Updated last year
chen37058 / Red-Team-Arxiv-Paper-Update
Awesome Jailbreak, red teaming arxiv papers (Automatically Update Every 12th hours)
☆81Updated last week
NJUNLP / ReNeLLM
The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…
☆150Updated 4 months ago
ydyjya / LLM-IHS-Explanation
☆55Updated last year
Helloworld10011 / Adversarial-Reasoning
A new algorithm that formulates jailbreaking as a reasoning problem.
☆26Updated 6 months ago
yjw1029 / Self-Reminder
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆55Updated 2 years ago
lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆60Updated last year
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆103Updated 7 months ago
isXinLiu / MM-SafetyBench
Accepted by ECCV 2024
☆179Updated last year
SheltonLiu-N / AutoDAN
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆419Updated 11 months ago
Princeton-SysML / Jailbreak_LLM
☆190Updated 2 years ago
ydyjya / SafetyHeadAttribution
☆61Updated 7 months ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆337Updated last year
mbzuai-nlp / AudioJailbreak
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
☆27Updated 3 months ago
git-disl / awesome_LLM-harmful-fine-tuning-papers
A survey on harmful fine-tuning attack for large language model
☆229Updated last week
JailbreakBench / jailbreakbench
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆506Updated 9 months ago
wonderNefelibata / Awesome-LRM-Safety
Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …
☆79Updated last week
isXinLiu / Awesome-MLLM-Safety
Accepted by IJCAI-24 Survey Track
☆226Updated last year