TUD-ARTS-2023 / LLM-red-teaming-promptsLinks
LLM red teaming datasets from the paper 'Student-Teacher Prompting for Red Teaming to Improve Guardrails' for the ART of Safety Workshop 2023
☆14Updated last year
Alternatives and similar repositories for LLM-red-teaming-prompts
Users that are interested in LLM-red-teaming-prompts are comparing it to the libraries listed below
Sorting:
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆104Updated last week
- Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"☆46Updated last year
- S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models☆96Updated 2 months ago
- The opensoure repository of FuzzLLM☆29Updated last year
- Test LLMs against jailbreaks and unprecedented harms☆35Updated 11 months ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆105Updated last year
- ☆34Updated last year
- CyberMetric dataset☆103Updated 8 months ago
- ☆46Updated last year
- Agent Security Bench (ASB)☆121Updated 3 months ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs☆289Updated last year
- ☆46Updated last year
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆88Updated 4 months ago
- Synthesizing realistic and diverse text-datasets from augmented LLMs☆15Updated 5 months ago
- ☆14Updated last year
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆155Updated 5 months ago
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆83Updated last year
- Papers about red teaming LLMs and Multimodal models.☆140Updated 3 months ago
- PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to a…☆424Updated last year
- Code and data of the EMNLP 2022 paper "Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversaria…☆56Updated 2 years ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆58Updated 11 months ago
- ☆13Updated 7 months ago
- An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)☆99Updated 8 months ago
- Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique☆18Updated last year
- [NeurIPS 2021] "Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models" by Boxin Wang*, Chejian Xu*, Shuoh…☆12Updated 2 years ago
- Repository for the paper "RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?"☆24Updated 4 months ago
- [ACL 2025] LongSafety: Evaluating Long-Context Safety of Large Language Models☆14Updated 3 months ago
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆15Updated 6 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆303Updated last year
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆91Updated 9 months ago