Babelscape / ALERTLinks

Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

☆44

Alternatives and similar repositories for ALERT

Users that are interested in ALERT are comparing it to the libraries listed below

Sorting:

declare-lab / red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆103Updated last year
DAMO-NLP-SG / multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆79Updated last year
Libr-AI / OpenRedTeaming
Papers about red teaming LLMs and Multimodal models.
☆131Updated 2 months ago
boyiwei / CoTaEval
[NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models
☆17Updated last year
Princeton-SysML / Jailbreak_LLM
☆178Updated last year
allenai / wildguard
Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
☆87Updated 8 months ago
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆140Updated last year
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆132Updated 2 months ago
allenai / wildteaming
☆32Updated 11 months ago
ThuCCSLab / JailbreakEval
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
☆165Updated 4 months ago
Yu-Fangxu / COLD-Attack
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆162Updated 7 months ago
SheltonLiu-N / Universal-Prompt-Injection
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆51Updated 9 months ago
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆85Updated last year
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆64Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆106Updated 5 months ago
ethz-spylab / unlearning-vs-safety
☆23Updated 10 months ago
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆84Updated 3 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆225Updated 10 months ago
thunlp / Advbench
Code and data of the EMNLP 2022 paper "Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversaria…
☆53Updated 2 years ago
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆160Updated last year
usail-hkust / JailTrickBench
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆144Updated 8 months ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆314Updated last year
amazon-science / controlling-llm-memorization
☆36Updated 2 years ago
tml-epfl / llm-adaptive-attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
☆324Updated 6 months ago
mlcommons / modelbench
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆99Updated this week
Lordog / R-Judge
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)
☆84Updated 2 months ago
GodXuxilie / PromptAttack
An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)
☆95Updated 6 months ago
poloclub / llm-landscape
NeurIPS'24 - LLM Safety Landscape
☆25Updated 5 months ago
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆156Updated 5 months ago