yjw1029 / Self-Reminder-Data
View external linksLinks

Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"

☆20

Alternatives and similar repositories for Self-Reminder-Data

Users that are interested in Self-Reminder-Data are comparing it to the libraries listed below

Sorting:

yjw1029 / Self-Reminder
View on GitHub
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆56Nov 13, 2023Updated 2 years ago
yjw1029 / UA-FedRec
View on GitHub
The python implementation of our "UA-FedRec: Untargeted Attack on Federated News Recommendation" in KDD 2023.
☆19Aug 2, 2022Updated 3 years ago
aounon / certified-llm-safety
View on GitHub
☆51Aug 10, 2024Updated last year
UCSB-NLP-Chang / Prereq_tune
View on GitHub
Implementation for the paper "Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning"
☆11Jan 10, 2025Updated last year
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated 8 months ago
shiningrain / JailGuard
View on GitHub
☆25Mar 16, 2025Updated 10 months ago
PorUna-byte / PAR
View on GitHub
☆20Aug 30, 2025Updated 5 months ago
llm-editing / editing-attack
View on GitHub
Code and dataset for the paper: "Can Editing LLMs Inject Harm?"
☆21Dec 26, 2025Updated last month
huanranchen / VLMTransfer
View on GitHub
A package that achieves 95%+ transfer attack success rate against GPT-4
☆26Oct 24, 2024Updated last year
lapisrocks / rpo
View on GitHub
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆61Aug 8, 2024Updated last year
SheltonLiu-N / Universal-Prompt-Injection
View on GitHub
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆68Oct 23, 2024Updated last year
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
View on GitHub
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆25Apr 18, 2024Updated last year
RockyHHH / Safety-Evaluating
View on GitHub
本文提出了一个基于“文心一言”的中国LLMs的安全评估基准，其中包括8种典型的安全场景和6种指令攻击类型。此外，本文还提出了安全评估的框架和过程，利用手动编写和收集开源数据的测试Prompts，以及人工干预结合利用LLM强大的评估能力作为“共同评估者”。
☆32Sep 1, 2023Updated 2 years ago
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆193Nov 26, 2023Updated 2 years ago
zhaoyiran924 / Probe-Sampling
View on GitHub
[NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
☆33Nov 8, 2024Updated last year
thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated last year
ShuchiWu / RDA
View on GitHub
☆11Dec 23, 2024Updated last year
ZhentingWang / DUMP
View on GitHub
☆34May 9, 2025Updated 9 months ago
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆151Jul 19, 2024Updated last year
WYuan1001 / AdaVD
View on GitHub
[CVPR2025] Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters
☆43Mar 11, 2025Updated 11 months ago
poloclub / llm-self-defense
View on GitHub
LLM Self Defense: By Self Examination, LLMs know they are being tricked
☆48May 21, 2024Updated last year
Wenjun-Peng / GPT4SM
View on GitHub
☆11Jun 7, 2023Updated 2 years ago
nirgreshler / bayesian-online-planning
View on GitHub
The code for the paper "A Bayesian Approach to Online Planning" published in ICML 2024.
☆13Jun 17, 2024Updated last year
mint-vu / Brainwash
View on GitHub
BrainWash: A Poisoning Attack to Forget in Continual Learning
☆12Apr 15, 2024Updated last year
CLEANit / heatenginegym
View on GitHub
A collection of heat engines, based on the OpenAI Gym environment framework for use with reinforcement learning applications.
☆15Dec 20, 2021Updated 4 years ago
inboxedshoe / RP-DQN
View on GitHub
☆11Jan 11, 2022Updated 4 years ago
wassname / rl_2d_walker.js
View on GitHub
Teaching a humanoid to walk(ish), then displaying in your browser (using tensorflow.js and reinforcement learning)
☆10Sep 7, 2020Updated 5 years ago
vint-1 / dreamsmooth
View on GitHub
DreamSmooth: Improving Model-Based RL with Reward Smoothing (ICLR 2024)
☆12May 6, 2024Updated last year
ReidarRiveland / Instruct-RNN
View on GitHub
☆14Mar 21, 2024Updated last year
harish-kamath / rqae
View on GitHub
Residual Quantization Autoencoder, used for interpreting LLMs
☆14Jan 1, 2025Updated last year
fpdetective / modCrawler
View on GitHub
Crawler based on a modified browser to detect online tracking.
☆11Jul 19, 2023Updated 2 years ago
strickvl / awesome-balochi-nlp
View on GitHub
A repository for resources relating to NLP in the Balochi language
☆19Jun 3, 2023Updated 2 years ago
Improbable-AI / orso
View on GitHub
☆16Feb 22, 2025Updated 11 months ago
holken / polite
View on GitHub
code for polite
☆11Feb 28, 2024Updated last year
MadryLab / D3M
View on GitHub
Debiasing Through Data Attribution
☆12May 23, 2024Updated last year
OuAzusaKou / imagination_mechanism
View on GitHub
About Code release for "Imagination Mechanism: Mesh Information Propagation for Enhancing Data Efficiency in Reinforcement Learning"
☆13Oct 7, 2023Updated 2 years ago
giladcohen / NNIF_adv_defense
View on GitHub
Detection of adversarial examples using influence functions and nearest neighbors
☆37Nov 22, 2022Updated 3 years ago
Victorwz / LaViA
View on GitHub
☆10Jul 13, 2024Updated last year
ZZZhr-1 / Robust_GUI_Grounding
View on GitHub
On the Robustness of GUI Grounding Models Against Image Attacks
☆12Apr 8, 2025Updated 10 months ago

yjw1029 / Self-Reminder-DataView external linksLinks

Alternatives and similar repositories for Self-Reminder-Data

yjw1029 / Self-Reminder-Data
View external linksLinks