openai / moderation-api-release
☆116Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for moderation-api-release
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆109Updated last year
- Scripts for generating synthetic finetuning data for reducing sycophancy.☆107Updated last year
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆62Updated this week
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆64Updated 10 months ago
- ☆190Updated 3 years ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆39Updated 3 weeks ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆63Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆79Updated 8 months ago
- This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.☆280Updated 5 months ago
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆213Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆154Updated last month
- SILO Language Models code repository☆80Updated 8 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆122Updated 8 months ago
- Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"☆53Updated last year
- ☆34Updated 3 months ago
- ☆94Updated 6 months ago
- Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks☆206Updated 10 months ago
- The GitHub repo for Goal Driven Discovery of Distributional Differences via Language Descriptions☆68Updated last year
- Code accompanying the paper Pretraining Language Models with Human Preferences☆177Updated 9 months ago
- Official repository for our NeurIPS 2023 paper "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense…☆138Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆115Updated 8 months ago
- The Prism Alignment Project☆37Updated 6 months ago
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆211Updated 10 months ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆208Updated last year
- ☆240Updated 4 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆82Updated 6 months ago
- ☆221Updated last year
- ☆158Updated last year
- ☆46Updated this week
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆155Updated 6 months ago