lapisrocks/rpo

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lapisrocks/rpo)

lapisrocks / rpo

Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"

☆62

Alternatives and similar repositories for rpo

Users that are interested in rpo are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lapisrocks / DiscreteAdversarialDistillation
View on GitHub
[NeurIPS 2023] Official repository for "Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models"
☆11Jun 18, 2024Updated 2 years ago
yjw1029 / Self-Reminder
View on GitHub
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆57Nov 13, 2023Updated 2 years ago
chujiezheng / LLM-Safeguard
View on GitHub
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆108May 20, 2025Updated last year
PKU-ML / PAT
View on GitHub
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆22May 6, 2025Updated last year
ys-zong / VLGuard
View on GitHub
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
☆90Jan 19, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
MiracleHH / nas_privacy
View on GitHub
Official Code Implementation for the CCS 2022 Paper "On the Privacy Risks of Cell-Based NAS Architectures"
☆11Nov 21, 2022Updated 3 years ago
facebookresearch / jailbreak-objectives
View on GitHub
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆37Jul 2, 2026Updated 2 weeks ago
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆203Nov 26, 2023Updated 2 years ago
IntologyAI / NanoGPT-Bench
View on GitHub
☆21Jul 3, 2026Updated 2 weeks ago
facebookresearch / SecAlign
View on GitHub
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆98Jul 2, 2026Updated 2 weeks ago
lancopku / agent-backdoor-attacks
View on GitHub
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
☆115Sep 27, 2024Updated last year
Jinxiaolong1129 / Foot-in-the-door-Jailbreak
View on GitHub
☆23May 14, 2025Updated last year
tmlr-group / DeepInception
View on GitHub
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆176Feb 20, 2024Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
weizeming / momentum-attack-llm
View on GitHub
☆25Jan 17, 2025Updated last year
XuandongZhao / weak-to-strong
View on GitHub
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆90May 2, 2025Updated last year
clearloveclearlove / BEAT
View on GitHub
☆15Feb 26, 2025Updated last year
sail-sg / Agent-Smith
View on GitHub
[ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
☆123Mar 26, 2024Updated 2 years ago
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
lapisrocks / fedselect
View on GitHub
[CVPR 2024] Official Repository for "FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning"
☆35Nov 4, 2024Updated last year
DPamK / BadAgent
View on GitHub
☆33Feb 27, 2025Updated last year
stanford-crfm / air-bench-2024
View on GitHub
AIR-Bench 2024 is a safety benchmark that aligns with emerging government regulations and company policies
☆30Aug 14, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
LLM-DRA / DRA
View on GitHub
[USENIX Security'24] Official repository of "Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise a…
☆116Oct 11, 2024Updated last year
arobey1 / smooth-llm
View on GitHub
☆135Nov 13, 2023Updated 2 years ago
rishub-tamirisa / tamper-resistance
View on GitHub
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆68Jun 9, 2025Updated last year
uiuc-kang-lab / AdaptiveAttackAgent
View on GitHub
☆38Mar 12, 2025Updated last year
locuslab / breaking-poisoned-classifier
View on GitHub
Code for paper "Poisoned classifiers are not only backdoored, they are fundamentally broken"
☆26Jan 7, 2022Updated 4 years ago
YiZeng623 / I-BAU
View on GitHub
Official Implementation of ICLR 2022 paper, ``Adversarial Unlearning of Backdoors via Implicit Hypergradient''
☆53Nov 16, 2022Updated 3 years ago
RPC2 / AutoInject
View on GitHub
☆20Jun 12, 2026Updated last month
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
ChenWu98 / agent-attack
View on GitHub
[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents
☆139Feb 19, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
xirui-li / DrAttack
View on GitHub
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆68Aug 25, 2024Updated last year
RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer
View on GitHub
Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
☆37Jun 1, 2025Updated last year
Yu-Fangxu / COLD-Attack
View on GitHub
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆176Dec 18, 2024Updated last year
yjw1029 / Self-Reminder-Data
View on GitHub
Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"
☆20Oct 26, 2023Updated 2 years ago
HKUST-KnowComp / LLM-Multistep-Jailbreak
View on GitHub
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆37Oct 15, 2023Updated 2 years ago
vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
pasquini-dario / LLM_NeuralExec
View on GitHub
Code to generate NeuralExecs (prompt injection for LLMs)
☆27Oct 5, 2025Updated 9 months ago