☆161Aug 9, 2022Updated 3 years ago
Alternatives and similar repositories for moderation-api-release
Users that are interested in moderation-api-release are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆122Dec 2, 2024Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆134Feb 24, 2025Updated last year
- Chef cookbooks for managing a Ceph cluster☆12Apr 2, 2023Updated 3 years ago
- Fluentd output plugin that sends events to Amazon Kinesis Streams and Amazon Kinesis Firehose.☆13Apr 2, 2023Updated 3 years ago
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆124May 12, 2026Updated last week
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆27Nov 20, 2023Updated 2 years ago
- BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).☆181Oct 27, 2023Updated 2 years ago
- ☆10Oct 31, 2022Updated 3 years ago
- Tensors and Dynamic neural networks in Python with strong GPU acceleration☆51Oct 4, 2021Updated 4 years ago
- Code implementation of R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning☆22Jul 8, 2024Updated last year
- ☆36Updated this week
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,841Jun 17, 2025Updated 11 months ago
- ☆132Nov 13, 2023Updated 2 years ago
- [NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes☆12Jun 12, 2023Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆45Oct 1, 2024Updated last year
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆953Aug 16, 2024Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆350Feb 23, 2024Updated 2 years ago
- ☆45Jun 19, 2025Updated 11 months ago
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆331May 13, 2025Updated last year
- This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.☆346Jun 17, 2024Updated last year
- This is the official Gtihub repo for our paper: "BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Lang…☆22Jul 3, 2024Updated last year
- This repository contains the training and evaluation code for llm-jp-modernbert-base.☆17Jun 17, 2025Updated 11 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆62Aug 30, 2024Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Causal Analysis of Agent Behavior for AI Safety☆20Jun 27, 2023Updated 2 years ago
- Lightblue LLM Eval Framework: tengu, elyza100, ja-mtbench, rakuda☆18Apr 29, 2026Updated 3 weeks ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated 2 years ago
- Code for the paper "Batch size invariance for policy optimization"☆60Apr 2, 2023Updated 3 years ago
- Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!☆356Oct 17, 2025Updated 7 months ago
- [SIGIR '22] Code for our SIGIR 2022 accepted paper : P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Pr…☆18Sep 24, 2023Updated 2 years ago
- Service for quickly aliasing and redirecting to long URLs☆25Apr 26, 2023Updated 3 years ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆103Jan 11, 2026Updated 4 months ago
- An Empirical Study of Memorization in NLP (ACL 2022)☆13Jun 22, 2022Updated 3 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]☆231Sep 29, 2024Updated last year
- ☆27Jun 5, 2024Updated last year
- Add-on package to gym, to record sequences of actions, observations, and rewards☆76Apr 2, 2023Updated 3 years ago
- ☆19Sep 29, 2024Updated last year
- A collection of infrastructure and tools for research in neural network interpretability.☆37Jan 25, 2019Updated 7 years ago
- A Test Collection of Computer Science Papers for Faceted Query by Example☆23Nov 28, 2021Updated 4 years ago
- Repository for out-of-tree scheduler plugins based on scheduler framework.☆14Apr 2, 2023Updated 3 years ago