TianduoWang / DPO-ST
[ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
☆30Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for DPO-ST
- [EMNLP Findings 2024 & ACL 2024 NLRSE Oral] Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards