This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
☆346Jun 17, 2024Updated last year
Alternatives and similar repositories for TOXIGEN
Users that are interested in TOXIGEN are comparing it to the libraries listed below
Sorting:
- Repository for the Dynamically Generated Hate Speech Dataset by Vidgen et al. (2021).☆46May 26, 2025Updated 9 months ago
- ☆44Jun 29, 2023Updated 2 years ago
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper☆86Mar 2, 2021Updated 4 years ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆339Feb 23, 2024Updated 2 years ago
- Röttger et al. (ACL 2021): "HateCheck: Functional Tests for Hate Speech Detection Models" - Data☆60Oct 14, 2025Updated 4 months ago
- ☆229Feb 23, 2021Updated 5 years ago
- Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transfor…☆1,187Jan 5, 2026Updated last month
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,816Jun 17, 2025Updated 8 months ago
- Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming so…☆17Jul 27, 2023Updated 2 years ago
- code for our EACL 2021 paper: "Challenges in Automated Debiasing for Toxic Language Detection" by Xuhui Zhou, Maarten Sap, Swabha Swayamd…☆19Aug 20, 2021Updated 4 years ago
- ☆12Oct 23, 2022Updated 3 years ago
- ☆28Feb 27, 2025Updated last year
- NSMC, KorSTS ... fine-tunings☆18Feb 23, 2022Updated 4 years ago
- Generalizable Implicit Hate Speech Detection using Contrastive Learning (COLING 2022)☆14Oct 9, 2022Updated 3 years ago
- "Why do I feel offended?" - Korean Dataset for Offensive Language Identification (EACL2023 Findings)☆15May 14, 2023Updated 2 years ago
- Data and code for the paper "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems"☆21Jul 18, 2023Updated 2 years ago
- This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…☆128Mar 1, 2024Updated last year
- Find and fix bugs in natural language machine learning models using adaptive testing.☆188May 7, 2024Updated last year
- Fortifying Toxic Speech Detectors Against Veiled Toxicity☆11Oct 21, 2020Updated 5 years ago
- Official repository of "HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning", Findings of EMNLP 2023☆28Jan 25, 2024Updated 2 years ago
- Can we use explanations to improve hate speech models? Our paper accepted at AAAI 2021 tries to explore that question.☆233Jun 12, 2023Updated 2 years ago
- Hate speech detection corpus in Korean, shared with EMNLP 2023 paper☆17Apr 19, 2024Updated last year
- Using GPT-3 to detect hate speech that contains sexist and racist content☆24Nov 11, 2025Updated 3 months ago
- IPython notebook with synthetic experiments for AFLite, based on the ICML 2020 paper, "Adversarial Filters of Dataset Biases".☆16Aug 14, 2020Updated 5 years ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆885Jan 16, 2025Updated last year
- ☆14Jan 6, 2025Updated last year
- python project template for personal projects! 🙋♀️☆11Nov 28, 2020Updated 5 years ago
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。☆1,129Feb 27, 2024Updated 2 years ago
- Data set for LREC 2020 paper "I Feel Offended, Don't Be Abusive!"☆18Sep 23, 2023Updated 2 years ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆127Feb 24, 2025Updated last year
- Code for Blodgett et al. 2016, Demographic dialectal variation in social media☆25Nov 9, 2019Updated 6 years ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆313Sep 16, 2024Updated last year
- ☆10Aug 31, 2022Updated 3 years ago
- Automated Pyramid Summarization Evaluation☆12Jun 2, 2024Updated last year
- Data and code for APPDIA: A Discourse-aware Transformer-based Style Transfer Model for Offensive Social Media Conversations (COLING 2022)…☆13Sep 8, 2022Updated 3 years ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated last year
- 面向中文大模型价值观的评估与对齐研究☆553Jul 20, 2023Updated 2 years ago
- A re-implementation of the "Extracting Training Data from Large Language Models" paper by Carlini et al., 2020☆39Jul 10, 2022Updated 3 years ago
- Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models☆266May 13, 2024Updated last year