dvlab-research / Step-DPO
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
☆359Updated 2 months ago
Alternatives and similar repositories for Step-DPO:
Users that are interested in Step-DPO are comparing it to the libraries listed below
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆301Updated 8 months ago
- ☆326Updated 2 months ago
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)☆613Updated 2 months ago
- A series of technical report on Slow Thinking with LLM☆630Updated this week
- SOTA RL fine-tuning solution for advanced math reasoning of LLM☆103Updated last week
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆256Updated 7 months ago
- The related works and background techniques about Openai o1☆218Updated 3 months ago
- An Easy-to-use, Scalable and High-performance RLHF Framework designed for Multimodal Models.☆109Updated last week
- ☆278Updated last month
- ☆356Updated last week
- ☆516Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆199Updated 11 months ago
- A Survey on Efficient Reasoning for LLMs☆301Updated 2 weeks ago
- Awesome RL Reasoning Recipes ("Triple R")☆375Updated this week
- [NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…☆358Updated 7 months ago
- 😎 A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond☆160Updated this week
- ☆265Updated 8 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆180Updated 3 weeks ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆222Updated 2 weeks ago
- ☆184Updated last month
- [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning☆431Updated 5 months ago
- ☆617Updated 2 weeks ago
- LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment☆318Updated 11 months ago
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆169Updated 3 weeks ago
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]☆546Updated 4 months ago
- Paper list for Efficient Reasoning.☆372Updated this week
- A jounery to real multimodel R1 ! We are doing on large-scale experiment☆289Updated last month
- A RLHF Infrastructure for Vision-Language Models☆171Updated 5 months ago
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆169Updated 3 months ago
- Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning☆162Updated last year