YuxiXie / MCTS-DPO
This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.
☆306Updated 9 months ago
Alternatives and similar repositories for MCTS-DPO:
Users that are interested in MCTS-DPO are comparing it to the libraries listed below
- ☆327Updated 3 months ago
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)☆619Updated 3 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆145Updated last month
- Reference implementation for Token-level Direct Preference Optimization(TDPO)☆138Updated 2 months ago
- Code for Paper (ReMax: A Simple, Efficient and Effective Reinforcement Learning Method for Aligning Large Language Models)☆181Updated last year
- ☆150Updated 4 months ago
- ☆122Updated 10 months ago
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆266Updated 11 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆123Updated 9 months ago
- ☆275Updated 4 months ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"☆167Updated 2 weeks ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆138Updated 6 months ago
- Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"☆112Updated 2 weeks ago
- ☆144Updated last month
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆260Updated 7 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆208Updated last year
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆146Updated 3 weeks ago
- ☆163Updated last month
- ☆287Updated last month
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.☆118Updated last month
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆120Updated 7 months ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆234Updated 3 weeks ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆90Updated 2 weeks ago
- ☆192Updated 2 months ago
- ☆138Updated this week
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆195Updated last month
- The related works and background techniques about Openai o1☆221Updated 3 months ago
- RewardBench: the first evaluation tool for reward models.☆562Updated 2 months ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆205Updated 2 years ago
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆52Updated 5 months ago