junkangwu / beta-DPOLinks

[NeurIPS 2024] Official code of $\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

☆49

Alternatives and similar repositories for beta-DPO

Users that are interested in beta-DPO are comparing it to the libraries listed below

Sorting:

bethgelab / sober-reasoning
A Sober Look at Language Model Reasoning
☆89Updated 2 weeks ago
ZHZisZZ / modpo
[ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
☆93Updated last year
junkangwu / alpha-DPO
[ICML 2025] Official code of "AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization"
☆22Updated last year
bobxwu / learning-from-rewards-llm-papers
A comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward model…
☆58Updated 5 months ago
sail-sg / CPO
[NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.
☆132Updated 8 months ago
ZHZisZZ / weak-to-strong-search
[NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
☆63Updated 11 months ago
BeyonderXX / TRACE
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models
☆81Updated last year
RUCAIBox / RLMEC
The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
☆38Updated last year
TianHongZXY / RLVR-Decomposed
[NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"
☆127Updated last month
Kwai-Klear / RLEP
RL with Experience Replay
☆49Updated 4 months ago
TianduoWang / DPO-ST
[ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
☆52Updated last year
RM-R1-UIUC / RM-R1
RM-R1: Unleashing the Reasoning Potential of Reward Models
☆151Updated 5 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆76Updated last month
RLHFlow / Directional-Preference-Alignment
Directional Preference Alignment
☆58Updated last year
rookie-joe / AutoPSV
☆51Updated last year
yyDing1 / ScaleQuest
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…
☆68Updated last year
OpenBMB / CPO
☆23Updated last year
Dereck0602 / Awesome_Test_Time_LLMs
☆134Updated 8 months ago
weizhepei / WebAgent-R1
[EMNLP 2025] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
☆60Updated last month
THU-KEG / RM-Bench
[ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
☆70Updated 4 months ago
holarissun / RewardModelingBeyondBradleyTerry
official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…
☆69Updated 8 months ago
SparkJiao / dpo-trajectory-reasoning
[EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".
☆82Updated 10 months ago
sanowl / Self-Correcting-LLM--Reinforcement-Learning-
This my attempt to create Self-Correcting-LLM based on the paper Training Language Models to Self-Correct via Reinforcement Learning by g…
☆37Updated 4 months ago
PKU-Alignment / aligner
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆190Updated 10 months ago
Zhou-Zoey / RMB-Reward-Model-Benchmark
☆46Updated 8 months ago
wizard-III / ArcherCodeR
ArcherCodeR is an open-source initiative enhancing code reasoning in large language models through scalable, rule-governed reinforcement …
☆43Updated 4 months ago
Vance0124 / Token-level-Direct-Preference-Optimization
Reference implementation for Token-level Direct Preference Optimization(TDPO)
☆148Updated 9 months ago
hahahawu / Long-to-Short-via-Model-Merging
Model merging is a highly efficient approach for long-to-short reasoning.
☆92Updated last month
WeiminXiong / IPR
Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)
☆63Updated last year
genrm-star / genrm-critiques
GenRM-CoT: Data release for verification rationales
☆66Updated last year