TU2021 / DPO-VP
Improving Math reasoning through Direct Preference Optimization with Verifiable Pairs
☆9Updated last month
Alternatives and similar repositories for DPO-VP
Users that are interested in DPO-VP are comparing it to the libraries listed below
Sorting:
- Code for NeurIPS 2024 paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs"☆34Updated 2 months ago
- [ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!☆36Updated 9 months ago
- [ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization☆76Updated 8 months ago
- Rewarded soups official implementation☆57Updated last year
- Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"☆69Updated 4 months ago
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆58Updated last month
- [NeurIPS 2023] Large Language Models Are Semi-Parametric Reinforcement Learning Agents☆35Updated last year
- An index of algorithms for reinforcement learning from human feedback (rlhf))☆92Updated last year
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Updated last year
- Official code for "Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning".☆47Updated last year
- Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"☆113Updated last week
- ☆15Updated 6 months ago
- DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents☆21Updated 2 months ago
- What Makes a Reward Model a Good Teacher? An Optimization Perspective☆28Updated last month
- Uni-RLHF platform for "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback" (ICLR2024…