XuanChen-xc/RLbreaker

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/XuanChen-xc/RLbreaker)

XuanChen-xc / RLbreaker

Code for "When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search" (NeurIPS 2024)

☆18

Alternatives and similar repositories for RLbreaker

Users that are interested in RLbreaker are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

alkaet / LobotoMl
View on GitHub
LobotoMl is a set of scripts and tools to assess production deployments of ML services
☆10May 16, 2022Updated 4 years ago
yuki-younai / MTSA
View on GitHub
offical implementation of MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
☆17Jun 2, 2025Updated last year
abc03570128 / Jailbreaking-Attack-against-Multimodal-Large-Language-Model
View on GitHub
☆63Aug 11, 2024Updated last year
ml-postech / selective-generation
View on GitHub
☆11Dec 8, 2024Updated last year
uwFengyuan / OCC-CLIP
View on GitHub
☆14Jan 4, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
HKUST-KnowComp / LLM-Multistep-Jailbreak
View on GitHub
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆37Oct 15, 2023Updated 2 years ago
sccsok / CoprGuard
View on GitHub
[CVPR 2025] Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models
☆14Sep 16, 2025Updated 10 months ago
SaFo-Lab / AutoDAN-Turbo
View on GitHub
[ICLR 2025 Spotlight] The official implementation of our ICLR2025 paper "AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to…
☆383Oct 8, 2025Updated 9 months ago
kriti-hippo / red_queen
View on GitHub
Red Queen Dataset and data generation template
☆27Dec 26, 2025Updated 7 months ago
qizhangli / Gradient-based-Jailbreak-Attacks
View on GitHub
Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs
☆12Nov 7, 2024Updated last year
Bai-YT / AdaptiveSmoothing
View on GitHub
Implementation of the paper "Improving the Accuracy-Robustness Trade-off of Classifiers via Adaptive Smoothing".
☆10Feb 6, 2024Updated 2 years ago
fmarcinek / LICNN
View on GitHub
Lateral Inhibition-Inspired Convolutional Neural Network for Visual Attention and Saliency Detection
☆13Nov 6, 2020Updated 5 years ago
yiksiu-chan / SpeakEasy
View on GitHub
[ICML 2025] Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
☆15Mar 7, 2026Updated 4 months ago
ole-knf / A-bidirectional-GPT-approach-for-detecting-malicious-network-traffic
View on GitHub
This approach of Intrusion Detection uses two GPT models, which are trained on normal network traffic, to predict sequences of communicat…
☆11Oct 3, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Kasraarabi / Hidden-in-the-Noise
View on GitHub
[ICLR 2025] Official implementation of 'Hidden in the Noise: Two-Stage Robust Watermarking for Images'
☆13May 5, 2025Updated last year
jiaxiaojunQAQ / OmniSafeBench-MM
View on GitHub
A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack–Defense Evaluation
☆75May 8, 2026Updated 2 months ago
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
solitude-alive / AwesomeWatermarking
View on GitHub
Watermarking papers
☆18Mar 31, 2026Updated 3 months ago
chuhac / Reasoning-to-Defend
View on GitHub
[EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
☆12Aug 22, 2025Updated 11 months ago
jfairoze / publicly-detectable-watermark
View on GitHub
☆15Jan 21, 2025Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆453Jan 22, 2025Updated last year
elliotaplant / role-based-rag
View on GitHub
Demo of Role-Based Access Control in LLM Vector Databases
☆18Nov 27, 2023Updated 2 years ago
Les1a / SoftTokenForMaskedDLM
View on GitHub
Introduce a continuous intermediate representation between "masks" and "tokens" for dLLM
☆15Dec 1, 2025Updated 7 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
PurduePAML / DBS
View on GitHub
☆18Aug 15, 2022Updated 3 years ago
ybwang119 / label_recovery
View on GitHub
[ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacks
☆14Feb 6, 2024Updated 2 years ago
wuxiyang1996 / AutoHallusion
View on GitHub
AutoHallusion Codebase (EMNLP 2024)
☆23Dec 6, 2024Updated last year
BrachioLab / adversarial_prompting
View on GitHub
☆53May 24, 2023Updated 3 years ago
Huang-yihao / Personalization-based_backdoor
View on GitHub
☆12Dec 18, 2024Updated last year
kamata1729 / visualize-pytorch
View on GitHub
Pytorch implementation of gradCAM, guidedBackProp, smoothGrad
☆13Mar 5, 2019Updated 7 years ago
sail-sg / I-FSJ
View on GitHub
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Jan 11, 2025Updated last year
fangjf1 / OpenSafeMLRM
View on GitHub
The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!
☆15Apr 8, 2025Updated last year
Ai-trainee / o1-flow
View on GitHub
Using Llama-3.1 70b on Groq to create o1-like reasoning chains
☆17Sep 22, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
patrickrchao / JailbreakingLLMs
View on GitHub
☆757Jul 2, 2025Updated last year
Ymm-cll / TrustAgent
View on GitHub
☆99Mar 20, 2025Updated last year
erfanshayegani / Jailbreak-In-Pieces
View on GitHub
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal…
☆93Jun 6, 2024Updated 2 years ago
Jasper-Yan / SCRL
View on GitHub
[ACL'26] Official Repository for The Paper: What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
☆15Apr 7, 2026Updated 3 months ago
SolidShen / RIPPLE_official
View on GitHub
☆20Feb 11, 2024Updated 2 years ago
dgoulet / tor-parser
View on GitHub
Tor consensus and server descriptor parser
☆14Nov 24, 2022Updated 3 years ago
ICL-ml4csec / SQIRL
View on GitHub
☆21Jun 27, 2023Updated 3 years ago