xiwenc1 / DRA-GRPOLinks
Official code for the paper: DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
☆21Updated 3 weeks ago
Alternatives and similar repositories for DRA-GRPO
Users that are interested in DRA-GRPO are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"☆38Updated 6 months ago
- A Sober Look at Language Model Reasoning☆92Updated 2 months ago
- ☆305Updated 6 months ago
- Discriminative Constrained Optimization for Reinforcing Large Reasoning Models☆50Updated 2 months ago
- [NeurIPS 2024] Official code of $\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$☆50Updated last year
- [ACL 2025 Main] (🏆 Outstanding Paper Award) Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Proba…☆15Updated 5 months ago
- ☆204Updated last month
- [NeurIPS 2025] What Makes a Reward Model a Good Teacher? An Optimization Perspective☆42Updated 4 months ago
- Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …☆57Updated this week
- ☆33Updated 2 months ago
- Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning☆36Updated last year
- A curated list of resources on Reinforcement Learning with Verifiable Rewards (RLVR) and the reasoning capability boundary of Large Langu…☆85Updated last month
- ☆63Updated 6 months ago
- [NeurIPS25 Spotlight] EMPO, A Fully Unsupervised RLVR Method☆93Updated 2 months ago
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.☆134Updated 10 months ago
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning☆89Updated 11 months ago
- ☆43Updated 5 months ago
- [ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization☆95Updated last year
- ☆45Updated last month
- [ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concen…☆85Updated 7 months ago
- 🔥🔥🔥Latest Papers, Codes on Uncertainty-based RL☆57Updated 5 months ago
- One-shot Entropy Minimization☆188Updated 7 months ago
- Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples☆44Updated 6 months ago
- [NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"☆157Updated 3 months ago
- ☆111Updated 7 months ago
- ☆47Updated 9 months ago
- ☆35Updated 5 months ago
- [ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"☆13Updated last year
- [ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"☆93Updated last year
- ACL'2025: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs. and preprint: SoftCoT++: Test-Time Scaling with Soft Chain-of…☆76Updated 8 months ago