sophie-xhonneux / Continuous-AdvTrainLinks

☆32

Alternatives and similar repositories for Continuous-AdvTrain

Users that are interested in Continuous-AdvTrain are comparing it to the libraries listed below

Sorting:

lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆59Updated last year
facebookresearch / jailbreak-objectives
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆35Updated 10 months ago
git-disl / Vaccine
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆47Updated 11 months ago
YihanWang617 / llm-jailbreaking-defense
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆58Updated 2 months ago
AI45Lab / CodeAttack
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆54Updated last month
swj0419 / muse_bench
☆29Updated 8 months ago
wagner-group / MarkMyWords
☆32Updated last year
papersPapers / BadPrompt
Code for the paper "BadPrompt: Backdoor Attacks on Continuous Prompts"
☆40Updated last year
llm-editing / editing-attack
Code and dataset for the paper: "Can Editing LLMs Inject Harm?"
☆21Updated last year
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆163Updated 6 months ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆23Updated 11 months ago
qingjiesjtu / USC
This is the code repository of our submission: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction.
☆63Updated 10 months ago
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆62Updated last year
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆25Updated 8 months ago
AISafety-HKUST / Backdoor_Safety_Tuning
Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)
☆26Updated 11 months ago
inspire-group / RobustRAG
☆21Updated last year
wonderNefelibata / Awesome-LRM-Safety
Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …
☆76Updated this week
byerose / Awesome-Foundation-Model-Security
A curated list of trustworthy Generative AI papers. Daily updating...
☆75Updated last year
rmin2000 / adv_tracing
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Updated last year
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆65Updated 8 months ago
Vinsonzyh / BlueSuffix
[ICLR 2025] BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
☆30Updated 2 weeks ago
cnut1648 / Model-Fingerprint
Fingerprint large language models
☆44Updated last year
ydyjya / SafetyHeadAttribution
☆53Updated 5 months ago
thu-coai / JailbreakDefense_GoalPriority
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Updated last year
lancopku / agent-backdoor-attacks
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
☆98Updated last year
git-disl / Lisa
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆24Updated last year
SproutNan / AI-Safety_SCAV
This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"
☆46Updated last month
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆89Updated last year
THU-BPM / Robust_Watermark
Code and data for paper "A Semantic Invariant Robust Watermark for Large Language Models" accepted by ICLR 2024.
☆34Updated last year
Vaidehi99 / InfoDeletionAttacks
☆47Updated 9 months ago