alexandrasouly/strongreject

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/alexandrasouly/strongreject)

alexandrasouly / strongreject

Repository for "StrongREJECT for Empty Jailbreaks" paper

☆157

Alternatives and similar repositories for strongreject

Users that are interested in strongreject are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

dsbowen / strong_reject
View on GitHub
☆146Jul 7, 2025Updated last year
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
andyzoujm / breaking-llama-guard
View on GitHub
Code to break Llama Guard
☆32Dec 7, 2023Updated 2 years ago
paul-rottger / xstest
View on GitHub
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆141Feb 24, 2025Updated last year
centerforaisafety / HarmBench
View on GitHub
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
☆1,017Aug 16, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
JailbreakBench / jailbreakbench
View on GitHub
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆638Apr 4, 2025Updated last year
NY1024 / Jailbreak_GPT4o
View on GitHub
☆28Jun 5, 2024Updated 2 years ago
McGill-NLP / AdversarialTriggers
View on GitHub
TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models
☆19Aug 17, 2025Updated 11 months ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
tml-epfl / llm-adaptive-attacks
View on GitHub
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
☆391Jan 23, 2025Updated last year
vinid / safety-tuned-llamas
View on GitHub
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆95May 9, 2024Updated 2 years ago
patrickrchao / JailbreakingLLMs
View on GitHub
☆757Jul 2, 2025Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆453Jan 22, 2025Updated last year
GraySwanAI / nanoGCG
View on GitHub
A fast + lightweight implementation of the GCG algorithm in PyTorch
☆344May 13, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
RICommunity / TAP
View on GitHub
TAP: An automated jailbreaking method for black-box LLMs
☆241Dec 10, 2024Updated last year
CryptoAILab / misalignment
View on GitHub
[NDSS'25] The official implementation of safety misalignment.
☆19Jan 8, 2025Updated last year
xirui-li / DrAttack
View on GitHub
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆68Aug 25, 2024Updated last year
facebookresearch / multimodal-fusion-jailbreaks
View on GitHub
Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)
☆20Oct 22, 2024Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
chuhac / Reasoning-to-Defend
View on GitHub
[EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
☆12Aug 22, 2025Updated 11 months ago
kevinyaobytedance / llm_eval
View on GitHub
LLM evaluation.
☆16Nov 7, 2023Updated 2 years ago
JonasGeiping / carving
View on GitHub
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆71Feb 22, 2024Updated 2 years ago
rgreenblatt / model_organism_public
View on GitHub
☆15Jun 17, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
AI45Lab / ActorAttack
View on GitHub
☆134Jun 29, 2026Updated last month
yjw1029 / Self-Reminder
View on GitHub
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆57Nov 13, 2023Updated 2 years ago
centerforaisafety / tdc2023-starter-kit
View on GitHub
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆92May 19, 2024Updated 2 years ago
facebookresearch / advprompter
View on GitHub
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆183May 6, 2024Updated 2 years ago
AI45Lab / CodeAttack
View on GitHub
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆62Oct 1, 2025Updated 9 months ago
KaidiXu / Beta-CROWN
View on GitHub
β-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Neural Network Verification
☆31Nov 9, 2021Updated 4 years ago
SchwinnL / circuit-breakers-eval
View on GitHub
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Apr 15, 2025Updated last year
tml-epfl / llm-past-tense
View on GitHub
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆78Jan 23, 2025Updated last year
thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
EleutherAI / elk-generalization
View on GitHub
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆33May 23, 2024Updated 2 years ago
ethz-spylab / rlhf_trojan_competition
View on GitHub
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆119Jun 13, 2024Updated 2 years ago
NJUNLP / ReNeLLM
View on GitHub
The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…
☆163Sep 2, 2025Updated 10 months ago
huizhang-L / CodeChameleon
View on GitHub
☆30Mar 20, 2024Updated 2 years ago
locuslab / acr-memorization
View on GitHub
☆41Dec 19, 2024Updated last year
AIM-Intelligence / Automated-Multi-Turn-Jailbreaks
View on GitHub
☆139Dec 3, 2025Updated 7 months ago
OpenSafetyLab / SALAD-BENCH
View on GitHub
【ACL 2024】 SALAD benchmark & MD-Judge
☆176Mar 8, 2025Updated last year