dsbowen / strong_rejectLinks

☆102

Alternatives and similar repositories for strong_reject

Users that are interested in strong_reject are comparing it to the libraries listed below

Sorting:

facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆169Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆238Updated last year
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆151Updated 5 months ago
Princeton-SysML / Jailbreak_LLM
☆187Updated last year
Yu-Fangxu / COLD-Attack
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆166Updated 10 months ago
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆115Updated last year
aengusl / latent-adversarial-training
☆43Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆87Updated 5 months ago
OSU-NLP-Group / AmpleGCG
AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM
☆75Updated 11 months ago
GraySwanAI / nanoGCG
A fast + lightweight implementation of the GCG algorithm in PyTorch
☆293Updated 5 months ago
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆146Updated last year
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆89Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
SheltonLiu-N / Universal-Prompt-Injection
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆60Updated last year
tml-epfl / llm-adaptive-attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
☆357Updated 9 months ago
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆69Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆87Updated last year
AI-secure / DecodingTrust
A Comprehensive Assessment of Trustworthiness in GPT Models
☆306Updated last year
CryptoAILab / JailbreakEval
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
☆172Updated 6 months ago
rotaryhammer / code-autodan
An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
☆44Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆62Updated 4 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆292Updated 4 months ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 9 months ago
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆64Updated 7 months ago
Libr-AI / OpenRedTeaming
Papers about red teaming LLMs and Multimodal models.
☆145Updated 5 months ago
chawins / pal
PAL: Proxy-Guided Black-Box Attack on Large Language Models
☆55Updated last year
allenai / wildteaming
☆35Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆328Updated last year
alexandrasouly / strongreject
Repository for "StrongREJECT for Empty Jailbreaks" paper
☆144Updated 11 months ago
Confirm-Solutions / flrt
Fluent student-teacher redteaming
☆23Updated last year