TU2021 / DPO-VPLinks
Improving Math reasoning through Direct Preference Optimization with Verifiable Pairs
☆13Updated 3 months ago
Alternatives and similar repositories for DPO-VP
Users that are interested in DPO-VP are comparing it to the libraries listed below
Sorting:
- Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"☆125Updated last week
- Code for NeurIPS 2024 paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs"☆37Updated 4 months ago
- An index of algorithms for reinforcement learning from human feedback (rlhf))☆92Updated last year
- [ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization☆80Updated 10 months ago
- Rewarded soups official implementation☆58Updated last year
- [ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!☆37Updated 10 months ago
- [NeurIPS 2023] Large Language Models Are Semi-Parametric Reinforcement Learning Agents☆34Updated last year
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆62Updated 2 months ago
- DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents☆24Updated 3 months ago
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Updated last year
- Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"☆72Updated 2 weeks ago
- Official codebase for "GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning".☆75Updated 3 weeks ago
- The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.☆191Updated last week
- ☆19Updated 2 weeks ago
- Official implementation of the NeurIPS 2024 paper CORY☆16Updated 3 months ago
- Uni-RLHF platform for "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback" (ICLR2024…☆36Updated 7 months ago
- Reinforced Multi-LLM Agents training☆17Updated 2 weeks ago
- ☆24Updated last year
- Implementation of ICLR 2025 paper "Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation"☆17Updated 8 months ago
- [ICML 2025] Official Implementation of GLIDER☆46Updated last month
- Reference implementation for Token-level Direct Preference Optimization(TDPO)☆141Updated 4 months ago
- A comprehensive collection of process reward models.☆92Updated 2 weeks ago
- What Makes a Reward Model a Good Teacher? An Optimization Perspective☆32Updated this week
- Code release for "Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search" published at NeurIPS '24.☆11Updated 4 months ago
- Official codebase for CuGRO: Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay☆30Updated last year
- Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".☆24Updated 7 months ago
- Official code for ICML 2024 paper, "RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences" (ICML 2024 Spotlight)☆30Updated 8 months ago
- [ICML 2025 Oral] The official repository for the paper "Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchma…☆58Updated last week
- Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts☆24Updated last year
- Direct preference optimization with f-divergences.☆13Updated 7 months ago