neelsjain / baseline-defenses
Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"
☆17Updated 10 months ago
Related projects: ⓘ
- ☆63Updated 10 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆110Updated 4 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆26Updated last month
- Jailbreaking Large Vision-language Models via Typographic Visual Prompts☆76Updated 4 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆77Updated 4 months ago
- An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)☆27Updated 7 months ago
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks☆11Updated 4 months ago
- SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors☆30Updated 2 months ago
- Accepted by ECCV 2024☆59Updated 2 months ago
- ☆19Updated 7 months ago
- Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"☆37Updated last month
- [NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes☆11Updated last year
- Improved techniques for optimization-based jailbreaking on large language models☆33Updated 3 months ago
- A collection of automated evaluators for assessing jailbreak attempts.☆55Updated 2 months ago
- JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and further assess …☆29Updated 2 months ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆27Updated 5 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆64Updated 2 weeks ago
- ☆22Updated this week
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆55Updated 2 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆79Updated 3 months ago
- ☆24Updated last month
- An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)☆40Updated 2 months ago
- ☆18Updated 3 months ago
- ☆143Updated 9 months ago
- ☆47Updated last year
- Code to generate NeuralExecs (prompt injection for LLMs)☆14Updated last month
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆72Updated last week
- [ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion☆14Updated this week
- TAP: An automated jailbreaking method for black-box LLMs☆106Updated 6 months ago
- Code for the paper "Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models"☆21Updated 5 months ago