WhileBug/AwesomeLLMJailBreakPapers

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/WhileBug/AwesomeLLMJailBreakPapers)

WhileBug / AwesomeLLMJailBreakPapers

Awesome LLM Jailbreak academic papers

☆166

Alternatives and similar repositories for AwesomeLLMJailBreakPapers

Users that are interested in AwesomeLLMJailBreakPapers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

patrickrchao / JailbreakingLLMs
View on GitHub
☆751Jul 2, 2025Updated last year
PKU-ML / PAT
View on GitHub
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆22May 6, 2025Updated last year
SproutNan / AI-Safety_Benchmark
View on GitHub
The official repository for guided jailbreak benchmark
☆30Jul 28, 2025Updated 11 months ago
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆201Nov 26, 2023Updated 2 years ago
Yu-Fangxu / COLD-Attack
View on GitHub
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆176Dec 18, 2024Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
jam3scampbell / llama-lying
View on GitHub
Code for our paper "Localizing Lying in Llama"
☆15Apr 24, 2025Updated last year
RICommunity / TAP
View on GitHub
TAP: An automated jailbreaking method for black-box LLMs
☆238Dec 10, 2024Updated last year
YihanWang617 / llm-jailbreaking-defense
View on GitHub
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆61Sep 11, 2025Updated 9 months ago
arobey1 / smooth-llm
View on GitHub
☆135Nov 13, 2023Updated 2 years ago
AAAAAAsuka / llm_defends
View on GitHub
code of paper "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM"
☆14Nov 17, 2023Updated 2 years ago
GodXuxilie / PromptAttack
View on GitHub
An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)
☆115Jan 21, 2025Updated last year
usail-hkust / JailTrickBench
View on GitHub
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆164Nov 30, 2024Updated last year
JailbreakBench / jailbreakbench
View on GitHub
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆622Apr 4, 2025Updated last year
tml-epfl / llm-adaptive-attacks
View on GitHub
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
☆388Jan 23, 2025Updated last year
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
TrustAIRLab / HateBench
View on GitHub
[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
☆14Mar 1, 2025Updated last year
RUCAIBox / HADES
View on GitHub
[ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking …
☆39Oct 23, 2024Updated last year
ydyjya / Awesome-LLM-Safety
View on GitHub
A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide…
☆1,877Updated this week
yueliu1999 / FlipAttack
View on GitHub
[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".
☆175May 2, 2025Updated last year
TrustAIRLab / VoiceJailbreakAttack
View on GitHub
Code for Voice Jailbreak Attacks Against GPT-4o.
☆38May 31, 2024Updated 2 years ago
OpenBMB / CPO
View on GitHub
☆29Jul 16, 2024Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆356Feb 23, 2024Updated 2 years ago
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆448Jan 22, 2025Updated last year
Allen-piexl / JailbreakZoo
View on GitHub
☆170Sep 2, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
centerforaisafety / HarmBench
View on GitHub
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
☆997Aug 16, 2024Updated last year
ethz-spylab / rlhf_trojan_competition
View on GitHub
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆118Jun 13, 2024Updated 2 years ago
TeamPigeonLab / CS-DJ
View on GitHub
Accept by CVPR 2025 (highlight)
☆25Jun 8, 2025Updated last year
JuliusHenke / autopentest
View on GitHub
CLI enabling more autonomous black-box penetration tests using Large Language Models (LLMs)
☆55Jul 1, 2026Updated last week
microsoft / BIPIA
View on GitHub
A benchmark for evaluating the robustness of LLMs and defenses to indirect prompt injection attacks.
☆142Apr 15, 2024Updated 2 years ago
pipilurj / MLLM-protector
View on GitHub
The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"
☆46Apr 21, 2024Updated 2 years ago
yangarbiter / adversarial-nonparametrics
View on GitHub
Robustness for Non-Parametric Classification: A Generic Attack and Defense
☆18Nov 21, 2022Updated 3 years ago
DAMO-NLP-SG / multilingual-safety-for-LLMs
View on GitHub
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆105Mar 7, 2024Updated 2 years ago
yflyl613 / FedRec
View on GitHub
[AAAI 2023] Official PyTorch implementation for "Untargeted Attack against Federated Recommendation Systems via Poisonous Item Embeddings…
☆27Jan 18, 2023Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆264Sep 24, 2024Updated last year
yjw1029 / UA-FedRec
View on GitHub
The python implementation of our "UA-FedRec: Untargeted Attack on Federated News Recommendation" in KDD 2023.
☆20Aug 2, 2022Updated 3 years ago
qingyu-qc / gpt_bionlp_benchmark
View on GitHub
☆25Jan 15, 2024Updated 2 years ago
corca-ai / awesome-llm-security
View on GitHub
A curation of awesome tools, documents and projects about LLM Security.
☆1,622Aug 20, 2025Updated 10 months ago
yjhuangcd / local-lipschitz
View on GitHub
Official implementation for Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds (NeurIPS, 2021).
☆25Sep 4, 2022Updated 3 years ago
Breakend / SelfDestructingModels
View on GitHub
☆14Aug 9, 2023Updated 2 years ago
tmlr-group / DeepInception
View on GitHub
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆175Feb 20, 2024Updated 2 years ago