This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
☆346Jun 17, 2024Updated last year
Alternatives and similar repositories for TOXIGEN
Users that are interested in TOXIGEN are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Repository for the Dynamically Generated Hate Speech Dataset by Vidgen et al. (2021).☆44May 26, 2025Updated last year
- ☆44Jun 29, 2023Updated 2 years ago
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper☆88Mar 2, 2021Updated 5 years ago
- Röttger et al. (ACL 2021): "HateCheck: Functional Tests for Hate Speech Detection Models" - Data☆59Oct 14, 2025Updated 7 months ago
- ☆232Feb 23, 2021Updated 5 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆351Feb 23, 2024Updated 2 years ago
- Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming so…☆17Jul 27, 2023Updated 2 years ago
- Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transfor…☆1,250Apr 6, 2026Updated 2 months ago
- code for our EACL 2021 paper: "Challenges in Automated Debiasing for Toxic Language Detection" by Xuhui Zhou, Maarten Sap, Swabha Swayamd…☆20Aug 20, 2021Updated 4 years ago
- ☆28Feb 27, 2025Updated last year
- Official repository of "HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning", Findings of EMNLP 2023☆28Jan 25, 2024Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,837Jun 17, 2025Updated 11 months ago
- Generalizable Implicit Hate Speech Detection using Contrastive Learning (COLING 2022)☆14Oct 9, 2022Updated 3 years ago
- ☆12Oct 23, 2022Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Can we use explanations to improve hate speech models? Our paper accepted at AAAI 2021 tries to explore that question.☆241Jun 12, 2023Updated 2 years ago
- Automated Pyramid Summarization Evaluation☆12Jun 2, 2024Updated 2 years ago
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。☆1,171Feb 27, 2024Updated 2 years ago
- Find and fix bugs in natural language machine learning models using adaptive testing.☆190May 7, 2024Updated 2 years ago
- "他山之石、可以攻玉":复旦JADE团队发布的大模型测评与治理系列☆512May 14, 2026Updated 3 weeks ago
- IPython notebook with synthetic experiments for AFLite, based on the ICML 2020 paper, "Adversarial Filters of Dataset Biases".☆16Aug 14, 2020Updated 5 years ago
- This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…☆135Mar 1, 2024Updated 2 years ago
- Hate speech detection corpus in Korean, shared with EMNLP 2023 paper☆17Apr 19, 2024Updated 2 years ago
- Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)☆20Oct 22, 2024Updated last year
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- NSMC, KorSTS ... fine-tunings☆18Feb 23, 2022Updated 4 years ago
- Repository for the Bias Benchmark for QA dataset.☆142Jan 8, 2024Updated 2 years ago
- python project template for personal projects! 🙋♀️☆11Nov 28, 2020Updated 5 years ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆925Jan 16, 2025Updated last year
- ☆27Nov 20, 2023Updated 2 years ago
- Fortifying Toxic Speech Detectors Against Veiled Toxicity☆11Oct 21, 2020Updated 5 years ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆316Sep 16, 2024Updated last year
- Using GPT-3 to detect hate speech that contains sexist and racist content☆24Nov 11, 2025Updated 6 months ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆111Mar 8, 2024Updated 2 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts☆586Feb 27, 2026Updated 3 months ago
- An original implementation of the paper "CREPE: Open-Domain Question Answering with False Presuppositions"☆16Nov 5, 2024Updated last year
- 面向中文大模型价值观的评估与对齐研究☆556Jul 20, 2023Updated 2 years ago
- A re-implementation of the "Extracting Training Data from Large Language Models" paper by Carlini et al., 2020☆39Jul 10, 2022Updated 3 years ago
- ☆161Aug 9, 2022Updated 3 years ago
- "Why do I feel offended?" - Korean Dataset for Offensive Language Identification (EACL2023 Findings)☆15May 14, 2023Updated 3 years ago
- Corpus Annotation Graph builder (CAG) is an architectural framework that employs the build-and-annotate pattern for creating a graph.☆14Dec 7, 2023Updated 2 years ago