rain152 / PATLinks
[NeurIPS 2024] Fight Back Against Jailbreaking via Prompt Adversarial Tuning
β10Updated last year
Alternatives and similar repositories for PAT
Users that are interested in PAT are comparing it to the libraries listed below
Sorting:
- Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"β19Updated 5 months ago
- [ICLR 2024 Spotlight π₯ ] - [ Best Paper Award SoCal NLP 2023 π] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modalβ¦β73Updated last year
- β63Updated 7 months ago
- A package that achieves 95%+ transfer attack success rate against GPT-4β23Updated last year
- [ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completionβ54Updated 3 weeks ago
- Code for Neurips 2024 paper "Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models"β56Updated 9 months ago
- β53Updated last year
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794β¦β22Updated last year
- Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"β28Updated 2 years ago
- To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Modelsβ32Updated 5 months ago
- [ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking β¦β30Updated last year
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritizationβ29Updated last year
- [ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacksβ13Updated last year
- Comprehensive Assessment of Trustworthiness in Multimodal Foundation Modelsβ22Updated 7 months ago
- β105Updated last year
- Accepted by ECCV 2024β169Updated last year
- The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!β13Updated 6 months ago
- β48Updated last year
- β23Updated 10 months ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as β¦β76Updated this week
- Repository for the Paper: Refusing Safe Prompts for Multi-modal Large Language Modelsβ18Updated last year
- β38Updated last year
- This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"β46Updated 2 weeks ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RLβ48Updated last month
- Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"β21Updated last month
- This is the code repository of our submission: Understanding the Dark Side of LLMsβ Intrinsic Self-Correction.β63Updated 10 months ago
- β52Updated 10 months ago
- β22Updated 7 months ago
- Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"β77Updated 8 months ago
- Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Modelsβ32Updated 4 months ago