Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆208Jul 19, 2024Updated last year
Alternatives and similar repositories for safety-rbr-code-and-data
Users that are interested in safety-rbr-code-and-data are comparing it to the libraries listed below
Sorting:
- ☆16Jul 23, 2024Updated last year
- ☆20Nov 3, 2024Updated last year
- ☆160Nov 23, 2024Updated last year
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆72May 22, 2025Updated 9 months ago
- ☆58Oct 4, 2025Updated 5 months ago
- RewardBench: the first evaluation tool for reward models.☆697Feb 16, 2026Updated 2 weeks ago
- Recipes to train reward model for RLHF.☆1,517Apr 24, 2025Updated 10 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆18Apr 15, 2025Updated 10 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆131Mar 9, 2024Updated last year
- [AAAI'26 Oral] Official Implementation of STAR-1: Safer Alignment of Reasoning LLMs with 1K Data☆33Apr 7, 2025Updated 10 months ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,818Jun 17, 2025Updated 8 months ago
- ☆11Mar 13, 2023Updated 2 years ago
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- Azure Command-Line Interface☆11Dec 10, 2023Updated 2 years ago
- [ICML 2025] Official code of "AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization"☆30Jan 10, 2026Updated last month
- Scalable toolkit for efficient model alignment☆849Oct 6, 2025Updated 4 months ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Apr 20, 2024Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Feb 22, 2024Updated 2 years ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Apr 28, 2024Updated last year
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆120Dec 10, 2024Updated last year
- Code for the API, workload execution, and agents underlying the LLMail-Inject Adpative Prompt Injection Challenge☆19Updated this week
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 8 months ago
- CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)☆73Jun 25, 2024Updated last year
- A recipe for online RLHF and online iterative DPO.☆540Dec 28, 2024Updated last year
- The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"☆39Jan 12, 2024Updated 2 years ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆864Aug 16, 2024Updated last year
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆107Mar 6, 2025Updated 11 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,094Jun 1, 2023Updated 2 years ago
- ☆25Sep 5, 2024Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆93May 9, 2024Updated last year
- Fluentd output plugin that sends events to Amazon Kinesis Streams and Amazon Kinesis Firehose.☆12Apr 2, 2023Updated 2 years ago
- ☆32Nov 18, 2025Updated 3 months ago
- ☆571Jul 19, 2024Updated last year
- ☆98Jun 27, 2024Updated last year
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆191Jan 16, 2025Updated last year
- Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback☆1,585Nov 24, 2025Updated 3 months ago
- ☆1,072Mar 6, 2024Updated last year