Vance0124 / Token-level-Direct-Preference-OptimizationLinks

Reference implementation for Token-level Direct Preference Optimization(TDPO)

☆148

Alternatives and similar repositories for Token-level-Direct-Preference-Optimization

Users that are interested in Token-level-Direct-Preference-Optimization are comparing it to the libraries listed below

Sorting:

liziniu / ReMax
Code for Paper (ReMax: A Simple, Efficient and Effective Reinforcement Learning Method for Aligning Large Language Models)
☆198Updated last year
thu-ml / Noise-Contrastive-Alignment
Code accompanying the paper "Noise Contrastive Alignment of Language Models with Explicit Rewards" (NeurIPS 2024)
☆57Updated last year
ZHZisZZ / modpo
[ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
☆93Updated last year
PRIME-RL / ImplicitPRM
Repo of paper "Free Process Rewards without Process Labels"
☆167Updated 8 months ago
CJReinforce / PURE
Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"
☆142Updated last month
GAIR-NLP / LIMR
☆213Updated 9 months ago
Linear95 / APO
Code for ACL2024 paper - Adversarial Preference Optimization (APO).
☆56Updated last year
PKU-Alignment / aligner
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆190Updated 10 months ago
PRIME-RL / Entropy-Mechanism-of-RL
The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.
☆390Updated 4 months ago
TianHongZXY / RLVR-Decomposed
[NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"
☆127Updated last month
sanowl / Self-Correcting-LLM--Reinforcement-Learning-
This my attempt to create Self-Correcting-LLM based on the paper Training Language Models to Self-Correct via Reinforcement Learning by g…
☆37Updated 4 months ago
WooooDyy / MathCritique
Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".
☆56Updated last year
kanishkg / cognitive-behaviors
☆216Updated 8 months ago
Dereck0602 / Awesome_Test_Time_LLMs
☆131Updated 8 months ago
YuxiXie / MCTS-DPO
This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.
☆327Updated last year
Edward-Sun / easy-to-hard
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
☆124Updated last year
IAAR-Shanghai / xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
☆140Updated 3 weeks ago
ZHZisZZ / weak-to-strong-search
[NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
☆63Updated 11 months ago
FreedomIntelligence / OVM
☆68Updated last year
OFA-Sys / gsm8k-ScRel
Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
☆267Updated last year
holarissun / RewardModelingBeyondBradleyTerry
official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…
☆69Updated 8 months ago
hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆119Updated 11 months ago
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆116Updated 3 months ago
YifeiZhou02 / ArCHer
Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"
☆198Updated 7 months ago
sail-sg / CPO
[NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.
☆132Updated 8 months ago
WooooDyy / LLM-Reverse-Curriculum-RL
Implementation of the ICML 2024 paper "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning" pr…
☆113Updated last year
genrm-star / genrm-critiques
GenRM-CoT: Data release for verification rationales
☆67Updated last year
RLHFlow / RAFT
This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…
☆37Updated last year
hahahawu / Long-to-Short-via-Model-Merging
Model merging is a highly efficient approach for long-to-short reasoning.
☆92Updated last month
tongyx361 / Awesome-LLM4Math
Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…
☆145Updated last year