kztakemoto / simbajaLinks

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

☆18

Alternatives and similar repositories for simbaja

Users that are interested in simbaja are comparing it to the libraries listed below

Sorting:

facebookresearch / jailbreak-objectives
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆34Updated 10 months ago
lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆58Updated last year
RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer
Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
☆32Updated 4 months ago
rotaryhammer / code-autodan
An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
☆44Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆61Updated last year
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 9 months ago
yuplin2333 / representation-space-jailbreak
Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…
☆22Updated last year
Princeton-SysML / Jailbreak_LLM
☆185Updated last year
SolidShen / RIPPLE_official
☆20Updated last year
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆89Updated last year
OSU-NLP-Group / AmpleGCG
AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM
☆74Updated 11 months ago
AI45Lab / CodeAttack
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆53Updated 3 weeks ago
thunxxx / MLLM-Jailbreak-evaluation-MMJ-Bench
☆63Updated 6 months ago
aengusl / latent-adversarial-training
☆43Updated last year
SheltonLiu-N / Universal-Prompt-Injection
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆60Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆86Updated 5 months ago
ethz-spylab / autoadvexbench
☆33Updated 5 months ago
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆168Updated last year
Lyz1213 / BadEdit
☆36Updated last year
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆23Updated 10 months ago
facebookresearch / SecAlign
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆72Updated 3 months ago
xirui-li / DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆64Updated last year
swj0419 / muse_bench
☆28Updated 7 months ago
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆62Updated 7 months ago
AI45Lab / ActorAttack
☆108Updated 8 months ago
Yu-Fangxu / COLD-Attack
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆166Updated 10 months ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Updated 4 months ago
lancopku / agent-backdoor-attacks
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
☆93Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆62Updated 4 months ago
facebookresearch / multimodal-fusion-jailbreaks
Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)
☆19Updated last year