mansicer / self-verificationLinks
☆17Updated 3 weeks ago
Alternatives and similar repositories for self-verification
Users that are interested in self-verification are comparing it to the libraries listed below
Sorting:
- ☆32Updated last year
- Rewarded soups official implementation☆62Updated 2 years ago
- ☆65Updated 10 months ago
- Code for NeurIPS 2024 paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs"☆46Updated 11 months ago
- Code for "Reasoning to Learn from Latent Thoughts"☆124Updated 9 months ago
- A repo for open research on building large reasoning models☆127Updated this week
- [arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies☆55Updated 2 weeks ago
- Implementation of ICLR 2025 paper "Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation"☆18Updated last year
- Code for Paper (ReMax: A Simple, Efficient and Effective Reinforcement Learning Method for Aligning Large Language Models)☆199Updated 2 years ago
- GenRM-CoT: Data release for verification rationales☆68Updated last year
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆70Updated 9 months ago
- ☆55Updated last year
- ☆117Updated last year
- ☆118Updated 9 months ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity☆47Updated 2 years ago
- Natural Language Reinforcement Learning☆101Updated 5 months ago
- Code for the paper "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment"☆184Updated 7 months ago
- ☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models☆19Updated 7 months ago
- Code accompanying the paper "Noise Contrastive Alignment of Language Models with Explicit Rewards" (NeurIPS 2024)☆58Updated last year
- The official code release for Q#: Provably Optimal Distributional RL for LLM Post-Training☆17Updated 10 months ago
- ☆109Updated last year
- SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning☆173Updated 4 months ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"☆201Updated 9 months ago
- [ACL'24] Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization☆94Updated last year
- Reinforcing General Reasoning without Verifiers☆93Updated 6 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆85Updated 8 months ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆116Updated 5 months ago
- ☆348Updated 5 months ago
- A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models☆71Updated 10 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆168Updated 10 months ago