dapurv5 / awesome-red-teaming-llmsLinks

Papers from our SoK on Red-Teaming (Accepted at TMLR)

☆32

Alternatives and similar repositories for awesome-red-teaming-llms

Users that are interested in awesome-red-teaming-llms are comparing it to the libraries listed below

Sorting:

centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆151Updated 5 months ago
Yu-Fangxu / COLD-Attack
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
☆168Updated 10 months ago
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆69Updated last year
briland / LLM-security-and-privacy
LLM security and privacy
☆51Updated last year
vinusankars / BEAST
Implementation of BEAST adversarial attack for language models (ICML 2024)
☆91Updated last year
microsoft / TaskTracker
TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a…
☆74Updated 2 months ago
liu00222 / Open-Prompt-Injection
This repository provides a benchmark for prompt injection attacks and defenses
☆318Updated this week
Libr-AI / OpenRedTeaming
Papers about red teaming LLMs and Multimodal models.
☆145Updated 5 months ago
RapidResponseBench / rapidresponsebench
☆35Updated 11 months ago
facebookresearch / SecAlign
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆72Updated 3 months ago
tml-epfl / llm-adaptive-attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
☆357Updated 9 months ago
tml-epfl / llm-past-tense
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆75Updated 9 months ago
chawins / pal
PAL: Proxy-Guided Black-Box Attack on Large Language Models
☆55Updated last year
AIM-Intelligence / Automated-Multi-Turn-Jailbreaks
☆94Updated 11 months ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
leondz / autoredteam
autoredteam: code for training models that automatically red team other language models
☆13Updated 2 years ago
CryptoAILab / JailbreakEval
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
☆172Updated 7 months ago
lve-org / lve
A repository of Language Model Vulnerabilities and Exposures (LVEs).
☆112Updated last year
patrickrchao / JailbreakingLLMs
☆639Updated 4 months ago
AI-secure / AgentPoison
[NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"
☆162Updated 6 months ago
JailbreakBench / jailbreakbench
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆446Updated 6 months ago
microsoft / BIPIA
A benchmark for evaluating the robustness of LLMs and defenses to indirect prompt injection attacks.
☆87Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆238Updated last year
AI-secure / RedCode
[NeurIPS'24] RedCode: Risky Code Execution and Generation Benchmark for Code Agents
☆52Updated 3 months ago
HumanCompatibleAI / tensor-trust
A prompt injection game to collect data for robust ML research
☆65Updated 9 months ago
ethz-spylab / agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
☆340Updated this week
usail-hkust / JailTrickBench
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆152Updated 11 months ago
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆62Updated 4 months ago
lancopku / agent-backdoor-attacks
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
☆94Updated last year
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆167Updated last year