mathllm / Step-Controlled_DPOLinks
☆22Updated last year
Alternatives and similar repositories for Step-Controlled_DPO
Users that are interested in Step-Controlled_DPO are comparing it to the libraries listed below
Sorting:
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment☆16Updated 9 months ago
- ☆45Updated this week
- The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"☆38Updated last year
- ☆37Updated last month
- ☆30Updated 9 months ago
- Codebase for Instruction Following without Instruction Tuning☆35Updated last year
- ☆18Updated 2 months ago
- [ICLR 2025] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization☆40Updated 7 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆51Updated 3 months ago
- The code and data for the paper JiuZhang3.0☆49Updated last year
- [arxiv: 2505.02156] Adaptive Thinking via Mode Policy Optimization for Social Language Agents☆44Updated 3 months ago
- ☆50Updated 11 months ago
- [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction☆82Updated 6 months ago
- Official code implementation for the ACL 2025 paper: 'Dynamic Scaling of Unit Tests for Code Reward Modeling'☆25Updated 4 months ago
- ☆26Updated 5 months ago
- ☆48Updated 7 months ago
- [ACL 2025] Are Your LLMs Capable of Stable Reasoning?☆30Updated last month
- [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆62Updated 9 months ago
- JudgeLRM: Large Reasoning Models as a Judge☆39Updated 2 weeks ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆111Updated 4 months ago
- ☆16Updated last year
- TreeRL: LLM Reinforcement Learning with On-Policy Tree Search in ACL'25☆68Updated 3 months ago
- [ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style☆62Updated 2 months ago
- [ACL-25] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.☆68Updated 11 months ago
- ☆21Updated 5 months ago
- ☆59Updated last year
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆69Updated 9 months ago
- Source code for our paper: "ARIA: Training Language Agents with Intention-Driven Reward Aggregation".☆22Updated last month
- The official repository of NeurIPS'25 paper "Ada-R1: From Long-Cot to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization"☆18Updated 2 weeks ago
- The official repository of the Omni-MATH benchmark.☆88Updated 9 months ago