YihanWang617/LLM-Jailbreaking-Defense-Backtranslation

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation)

YihanWang617 / LLM-Jailbreaking-Defense-Backtranslation

Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"

☆34

Alternatives and similar repositories for LLM-Jailbreaking-Defense-Backtranslation

Users that are interested in LLM-Jailbreaking-Defense-Backtranslation are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

YihanWang617 / llm-jailbreaking-defense
View on GitHub
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆61Sep 11, 2025Updated 10 months ago
UCSB-NLP-Chang / SemanticSmooth
View on GitHub
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'
☆24Jun 9, 2024Updated 2 years ago
AI45Lab / CodeAttack
View on GitHub
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆62Oct 1, 2025Updated 9 months ago
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆203Nov 26, 2023Updated 2 years ago
theshi-1128 / jailbreak-bench
View on GitHub
The most comprehensive and accurate LLM jailbreak attack benchmark by far
☆21Mar 22, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
alphadl / SafeLLM_with_IntentionAnalysis
View on GitHub
Towards Safe LLM with our simple-yet-highly-effective Intention Analysis Prompting
☆21Mar 25, 2024Updated 2 years ago
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
PKU-ML / PAT
View on GitHub
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆22May 6, 2025Updated last year
poloclub / llm-self-defense
View on GitHub
LLM Self Defense: By Self Examination, LLMs know they are being tricked
☆52May 21, 2024Updated 2 years ago
kriti-hippo / red_queen
View on GitHub
Red Queen Dataset and data generation template
☆27Dec 26, 2025Updated 7 months ago
Aatrox103 / SAP
View on GitHub
☆49May 9, 2024Updated 2 years ago
sail-sg / I-FSJ
View on GitHub
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Jan 11, 2025Updated last year
EasyJailbreak / EasyJailbreak
View on GitHub
An easy-to-use Python framework to generate adversarial jailbreak prompts.
☆876Mar 30, 2026Updated 3 months ago
SheltonLiu-N / Universal-Prompt-Injection
View on GitHub
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆73Oct 23, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
tmllab / 2025_ICLR_PiF
View on GitHub
☆40May 17, 2025Updated last year
shizhouxing / LLM-Detector-Robustness
View on GitHub
[TACL] Code for "Red Teaming Language Model Detectors with Language Models"
☆24Nov 24, 2023Updated 2 years ago
wagner-group / prompt-injection-defense
View on GitHub
Fine-tuning base models to build robust task-specific models
☆36Apr 11, 2024Updated 2 years ago
STAIR-BUPT / STAIR-LLMGuardrails
View on GitHub
☆12Sep 29, 2024Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆453Jan 22, 2025Updated last year
uw-nsl / safechain
View on GitHub
[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
☆30Apr 2, 2025Updated last year
JailbreakBench / jailbreakbench
View on GitHub
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆638Apr 4, 2025Updated last year
jxnl / mit-lecture
View on GitHub
☆10Feb 25, 2025Updated last year
arobey1 / smooth-llm
View on GitHub
☆135Nov 13, 2023Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
DAMO-NLP-SG / multilingual-safety-for-LLMs
View on GitHub
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆106Mar 7, 2024Updated 2 years ago
ZHZisZZ / emulated-disalignment
View on GitHub
[ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
☆39Aug 2, 2024Updated last year
qizhangli / Gradient-based-Jailbreak-Attacks
View on GitHub
Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs
☆12Nov 7, 2024Updated last year
lapisrocks / rpo
View on GitHub
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆62Aug 8, 2024Updated last year
ltroin / llm_attack_defense_arena
View on GitHub
☆86Sep 5, 2025Updated 10 months ago
declare-lab / safety-arithmetic
View on GitHub
☆13Jan 14, 2025Updated last year
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
XHMY / AutoDefense
View on GitHub
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
☆68Jan 15, 2026Updated 6 months ago
vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
peldszus / arg-microtexts-multilayer
View on GitHub
Argumentative microtexts annotated with RST, SDRT and argumentation structure
☆12Jun 19, 2016Updated 10 years ago
thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated 2 years ago
wuch15 / HiTransformer
View on GitHub
ACL 2021: HiTransformer
☆13May 29, 2021Updated 5 years ago
DanielSc4 / Dynamic-Activation-Composition
View on GitHub
Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"
☆14Nov 22, 2024Updated last year
patrickrchao / JailbreakingLLMs
View on GitHub
☆757Jul 2, 2025Updated last year
wslong20 / G-safeguard
View on GitHub
☆42Jun 28, 2025Updated last year
zyxnlp / ICL-Interpretation-Analysis-Resources
View on GitHub
Links to publications that focus on the interpretation and analysis of in-context learning
☆14Oct 17, 2024Updated last year