WindyLee0822 / Process_Q_ModelLinks

official implementation of paper "Process Reward Model with Q-value Rankings"

☆65

Alternatives and similar repositories for Process_Q_Model

Users that are interested in Process_Q_Model are comparing it to the libraries listed below

Sorting:

THUDM / T1
RL Scaling and Test-Time Scaling (ICML'25)
☆112Updated 10 months ago
PRIME-RL / ImplicitPRM
Repo of paper "Free Process Rewards without Process Labels"
☆167Updated 8 months ago
hkust-nlp / B-STaR
B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
☆86Updated 6 months ago
jwhj / OREO
☆117Updated 10 months ago
HKUNLP / critic-rl
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆118Updated 6 months ago
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆179Updated 4 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆112Updated 4 months ago
YangLing0818 / SuperCorrect-llm
[ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction
☆83Updated 8 months ago
SiliangZeng / Multi-Turn-RL-Agent
☆98Updated 5 months ago
zitian-gao / SC-MCTS
Interpretable Contrastive Monte Carlo Tree Search Reasoning
☆48Updated last year
sail-sg / CPO
[NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.
☆132Updated 8 months ago
icip-cas / Verifier-Engineering
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
☆63Updated 11 months ago
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆116Updated 3 months ago
Edward-Sun / easy-to-hard
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
☆124Updated last year
SynthLabsAI / big-math
A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
☆68Updated 9 months ago
zankner / CLoud
Critique-out-Loud Reward Models
☆70Updated last year
WooooDyy / LLM-Reverse-Curriculum-RL
Implementation of the ICML 2024 paper "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning" pr…
☆113Updated last year
Yifan-Song793 / ETO
Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)
☆159Updated last year
waterhorse1 / Natural-language-RL
Natural Language Reinforcement Learning
☆100Updated 4 months ago
rookie-joe / AutoPSV
☆51Updated last year
test-time-interaction / TTI
☆65Updated 5 months ago
sanjibanc / agent_prm
☆49Updated 9 months ago
Vance0124 / Token-level-Direct-Preference-Optimization
Reference implementation for Token-level Direct Preference Optimization(TDPO)
☆148Updated 9 months ago
LAMDASZ-ML / Self-Backtracking
☆51Updated 9 months ago
GuanghaoYe / Emergence-of-Thinking
☆53Updated 9 months ago
NineAbyss / S2R
This is the official implementation of the paper "S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning"
☆72Updated 7 months ago
da03 / Internalize_CoT_Step_by_Step
☆199Updated 7 months ago
ryoungj / BoLT
Code for "Reasoning to Learn from Latent Thoughts"
☆122Updated 8 months ago
google-deepmind / bbeh
☆105Updated 6 months ago
GAIR-NLP / OctoThinker
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆180Updated 4 months ago