yjw1029 / Self-Reminder
View external linksLinks

Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.

☆56

Alternatives and similar repositories for Self-Reminder

Users that are interested in Self-Reminder are comparing it to the libraries listed below

Sorting:

yjw1029 / Self-Reminder-Data
View on GitHub
Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"
☆20Oct 26, 2023Updated 2 years ago
W-Ted / UDC-NeRF
View on GitHub
Official code for ICCV2023 paper: Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis
☆34Dec 27, 2023Updated 2 years ago
lapisrocks / rpo
View on GitHub
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆61Aug 8, 2024Updated last year
thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated last year
pipilurj / ROBOT
View on GitHub
☆27Apr 11, 2023Updated 2 years ago
rt219 / LatentGuard
View on GitHub
This is the official repo of the paper "Latent Guard: a Safety Framework for Text-to-image Generation"
☆52Oct 24, 2024Updated last year
arobey1 / smooth-llm
View on GitHub
☆122Nov 13, 2023Updated 2 years ago
pipilurj / MLLM-protector
View on GitHub
The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"
☆44Apr 21, 2024Updated last year
pipilurj / bootstrapped-preference-optimization-BPO
View on GitHub
code for "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization"
☆60Aug 23, 2024Updated last year
SumilerGAO / SunGen
View on GitHub
☆27Feb 26, 2023Updated 2 years ago
xyq7 / FedREDefense
View on GitHub
The official code for ICML 2024 "FedREDefense: Defending against Model Poisoning Attacks for Federated Learning using Model Update Recons…
☆29Jun 6, 2024Updated last year
xirui-li / DrAttack
View on GitHub
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆65Aug 25, 2024Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆427Jan 22, 2025Updated last year
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆193Nov 26, 2023Updated 2 years ago
kriti-hippo / red_queen
View on GitHub
Red Queen Dataset and data generation template
☆25Dec 26, 2025Updated last month
Aatrox103 / SAP
View on GitHub
☆48May 9, 2024Updated last year
tmlr-group / DeepInception
View on GitHub
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆173Feb 20, 2024Updated last year
shiningrain / JailGuard
View on GitHub
☆25Mar 16, 2025Updated 10 months ago
chujiezheng / LLM-Safeguard
View on GitHub
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆106May 20, 2025Updated 8 months ago
NLie2 / what_features_jailbreak_LLMs
View on GitHub
☆18Mar 30, 2025Updated 10 months ago
Yu-Fangxu / COLD-Attack
View on GitHub
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆176Dec 18, 2024Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆338Feb 23, 2024Updated last year
YihanWang617 / llm-jailbreaking-defense
View on GitHub
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆61Sep 11, 2025Updated 5 months ago
Allen-piexl / JailbreakZoo
View on GitHub
☆164Sep 2, 2024Updated last year
poloclub / llm-self-defense
View on GitHub
LLM Self Defense: By Self Examination, LLMs know they are being tricked
☆48May 21, 2024Updated last year
XuandongZhao / weak-to-strong
View on GitHub
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆92May 2, 2025Updated 9 months ago
pipilurj / DynaFed
View on GitHub
☆50Apr 1, 2023Updated 2 years ago
llm-editing / editing-attack
View on GitHub
Code and dataset for the paper: "Can Editing LLMs Inject Harm?"
☆21Dec 26, 2025Updated last month
aounon / certified-llm-safety
View on GitHub
☆51Aug 10, 2024Updated last year
SafeAILab / RAIN
View on GitHub
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆98May 23, 2024Updated last year
jbaron34 / torchwindow
View on GitHub
Display tensors directly from GPU
☆11Oct 12, 2025Updated 4 months ago
RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer
View on GitHub
Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
☆37Jun 1, 2025Updated 8 months ago
pipilurj / G-LLaVA
View on GitHub
Official github repo of G-LLaVA
☆148Feb 20, 2025Updated 11 months ago
rotaryhammer / code-autodan
View on GitHub
An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
☆45Feb 8, 2024Updated 2 years ago
stevejaehyeok / MoCo-NeRF
View on GitHub
☆11Jul 17, 2024Updated last year
nvedant07 / Fairness-Through-Robustness
View on GitHub
Official code for FAccT'21 paper "Fairness Through Robustness: Investigating Robustness Disparity in Deep Learning" https://arxiv.org/abs…
☆13Mar 9, 2021Updated 4 years ago
declare-lab / resta
View on GitHub
Restore safety in fine-tuned language models through task arithmetic
☆32Mar 28, 2024Updated last year
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆173Apr 23, 2025Updated 9 months ago
CryptoAILab / FigStep
View on GitHub
[AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts
☆191Jun 26, 2025Updated 7 months ago

yjw1029 / Self-ReminderView external linksLinks

Alternatives and similar repositories for Self-Reminder

yjw1029 / Self-Reminder
View external linksLinks