sail-sg / feedback-conditional-policyLinks
Code for "Language Models Can Learn from Verbal Feedback Without Scalar Rewards"
☆47Updated 3 weeks ago
Alternatives and similar repositories for feedback-conditional-policy
Users that are interested in feedback-conditional-policy are comparing it to the libraries listed below
Sorting:
- Optimizing Anytime Reasoning via Budget Relative Policy Optimization☆47Updated 3 months ago
- ☆63Updated 4 months ago
- Reinforcing General Reasoning without Verifiers☆90Updated 4 months ago
- ☆44Updated 3 weeks ago
- ☆50Updated 8 months ago
- ☆17Updated 2 months ago
- [ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆69Updated 3 months ago
- Emergent Hierarchical Reasoning in LLMs/VLMs through Reinforcement Learning☆40Updated last month
- Code for "Reasoning to Learn from Latent Thoughts"☆121Updated 6 months ago
- Code for Heima☆56Updated 6 months ago
- [COLM 2025] Official code for "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoni…☆13Updated last month
- A Recipe for Building LLM Reasoners to Solve Complex Instructions☆27Updated 2 weeks ago
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆128Updated 3 months ago
- A Sober Look at Language Model Reasoning☆85Updated 2 weeks ago
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment☆16Updated 10 months ago
- The official code repository for the paper "Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions".☆15Updated last month
- Reasoning Activation in LLMs via Small Model Transfer (NeurIPS 2025)☆19Updated last week
- ☆21Updated 5 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆44Updated 6 months ago
- Process Reward Models That Think☆59Updated last week
- Source code for our paper: "ARIA: Training Language Agents with Intention-Driven Reward Aggregation".☆23Updated 2 months ago
- ☆45Updated 3 weeks ago
- A repo for open research on building large reasoning models☆107Updated last week
- The official repository of NeurIPS'25 paper "Ada-R1: From Long-Cot to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization"☆19Updated last month
- Large Language Models Can Self-Improve in Long-context Reasoning☆73Updated 11 months ago
- ☆19Updated 6 months ago
- From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.☆23Updated 2 weeks ago
- [ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective☆75Updated 4 months ago
- Official Repository of LatentSeek☆66Updated 4 months ago
- [ACL 2025] Are Your LLMs Capable of Stable Reasoning?☆30Updated 2 months ago