This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
☆344Jun 17, 2024Updated last year
Alternatives and similar repositories for TOXIGEN
Users that are interested in TOXIGEN are comparing it to the libraries listed below
Sorting:
- Repository for the Dynamically Generated Hate Speech Dataset by Vidgen et al. (2021).☆46May 26, 2025Updated 9 months ago
- ☆44Jun 29, 2023Updated 2 years ago
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper☆87Mar 2, 2021Updated 5 years ago
- Röttger et al. (ACL 2021): "HateCheck: Functional Tests for Hate Speech Detection Models" - Data☆59Oct 14, 2025Updated 5 months ago
- ☆230Feb 23, 2021Updated 5 years ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆345Feb 23, 2024Updated 2 years ago
- Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming so…☆17Jul 27, 2023Updated 2 years ago
- Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transfor…☆1,202Jan 5, 2026Updated 2 months ago
- code for our EACL 2021 paper: "Challenges in Automated Debiasing for Toxic Language Detection" by Xuhui Zhou, Maarten Sap, Swabha Swayamd…☆19Aug 20, 2021Updated 4 years ago
- ☆28Feb 27, 2025Updated last year
- Official repository of "HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning", Findings of EMNLP 2023☆28Jan 25, 2024Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,827Jun 17, 2025Updated 9 months ago
- Generalizable Implicit Hate Speech Detection using Contrastive Learning (COLING 2022)☆14Oct 9, 2022Updated 3 years ago
- ☆12Oct 23, 2022Updated 3 years ago
- Can we use explanations to improve hate speech models? Our paper accepted at AAAI 2021 tries to explore that question.☆236Jun 12, 2023Updated 2 years ago
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。☆1,136Feb 27, 2024Updated 2 years ago
- Find and fix bugs in natural language machine learning models using adaptive testing.☆188May 7, 2024Updated last year
- "他山之石、可以攻玉":复旦白泽智能发布面向国内开源和国外商用大模型的Demo数据集JADE-DB☆500Nov 18, 2025Updated 4 months ago
- IPython notebook with synthetic experiments for AFLite, based on the ICML 2020 paper, "Adversarial Filters of Dataset Biases".☆16Aug 14, 2020Updated 5 years ago
- This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…☆130Mar 1, 2024Updated 2 years ago
- Hate speech detection corpus in Korean, shared with EMNLP 2023 paper☆17Apr 19, 2024Updated last year
- Repository for the Bias Benchmark for QA dataset.☆139Jan 8, 2024Updated 2 years ago
- Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)☆19Oct 22, 2024Updated last year
- NSMC, KorSTS ... fine-tunings☆18Feb 23, 2022Updated 4 years ago
- python project template for personal projects! 🙋♀️☆11Nov 28, 2020Updated 5 years ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆891Jan 16, 2025Updated last year
- ☆27Nov 20, 2023Updated 2 years ago
- Fortifying Toxic Speech Detectors Against Veiled Toxicity☆11Oct 21, 2020Updated 5 years ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆314Sep 16, 2024Updated last year
- Using GPT-3 to detect hate speech that contains sexist and racist content☆24Nov 11, 2025Updated 4 months ago
- ☆14Jan 6, 2025Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated 2 years ago
- Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts☆573Feb 27, 2026Updated 3 weeks ago
- An original implementation of the paper "CREPE: Open-Domain Question Answering with False Presuppositions"☆16Nov 5, 2024Updated last year
- Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models☆269May 13, 2024Updated last year
- 面向中文大模型价值观的评估与对齐研究☆555Jul 20, 2023Updated 2 years ago
- A re-implementation of the "Extracting Training Data from Large Language Models" paper by Carlini et al., 2020☆39Jul 10, 2022Updated 3 years ago
- "Why do I feel offended?" - Korean Dataset for Offensive Language Identification (EACL2023 Findings)☆15May 14, 2023Updated 2 years ago
- Official datasets and pytorch implementation repository of SQuARe and KoSBi (ACL 2023)☆249Jun 29, 2023Updated 2 years ago