☆116Jan 21, 2025Updated last year
Alternatives and similar repositories for OREO
Users that are interested in OREO are comparing it to the libraries listed below
Sorting:
- The official implementation of Preference Data Reward-Augmentation.☆18May 1, 2025Updated 10 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆86May 21, 2025Updated 10 months ago
- [AAAI 2026] Official codebase for "GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning".☆94Nov 8, 2025Updated 4 months ago
- ☆64Jan 12, 2026Updated 2 months ago
- The official implementation of Self-Exploring Language Models (SELM)☆63Jun 4, 2024Updated last year
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆78Aug 17, 2024Updated last year
- A benchmark for evaluating reinforcement learning algorithms that train the policies using imaginary rollouts from LLMs.☆14Nov 4, 2025Updated 4 months ago
- Author implementation of Monte Carlo Augmented Actor Critic in PyTorch☆18Oct 24, 2022Updated 3 years ago
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- This is an official implementation of the paper ``Building Math Agents with Multi-Turn Iterative Preference Learning'' with multi-turn DP…☆32Dec 5, 2024Updated last year
- Emergent Hierarchical Reasoning in LLMs/VLMs through Reinforcement Learning☆62Oct 24, 2025Updated 4 months ago
- Scalable RL solution for advanced reasoning of language models☆1,821Mar 18, 2025Updated last year
- Reference implementation for Token-level Direct Preference Optimization(TDPO)☆152Feb 14, 2025Updated last year
- RENT (Reinforcement Learning via Entropy Minimization) is an unsupervised method for training reasoning LLMs.☆43Oct 31, 2025Updated 4 months ago
- ☆77Jun 28, 2025Updated 8 months ago
- [ACL 2025] A Generalizable and Purely Unsupervised Self-Training Framework☆71Jun 1, 2025Updated 9 months ago
- UFT: Unifying Supervised and Reinforcement Fine-Tuning☆27Jun 30, 2025Updated 8 months ago
- ☆68Jul 8, 2025Updated 8 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆264May 5, 2025Updated 10 months ago
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆187May 20, 2025Updated 10 months ago
- ☆20Dec 14, 2024Updated last year
- Official implementation of ECCV24 paper: POA☆24Aug 8, 2024Updated last year
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆125Jun 11, 2025Updated 9 months ago
- ☆66Feb 4, 2026Updated last month
- Official code for ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning (AAAI'24)☆17Feb 10, 2024Updated 2 years ago
- Code and models for EMNLP 2024 paper "WPO: Enhancing RLHF with Weighted Preference Optimization"☆41Sep 24, 2024Updated last year
- Train transformer language models with reinforcement learning.☆19Feb 25, 2025Updated last year
- The rule-based evaluation subset and code implementation of Omni-MATH☆27Dec 23, 2024Updated last year
- [NeurIPS 2023 Spotlight] Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training☆36Apr 7, 2025Updated 11 months ago
- ICML 2024 - Official Repository for EXO: Towards Efficient Exact Optimization of Language Model Alignment☆56Jun 16, 2024Updated last year
- ☆79Nov 19, 2024Updated last year
- Implementation of the ICML 2024 paper "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning" pr…☆116Feb 9, 2024Updated 2 years ago
- [ACL 2025] "World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning." https://arxiv.org/abs/2503.1…☆17Jul 22, 2025Updated 8 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆170Mar 14, 2025Updated last year
- ☆13May 21, 2024Updated last year
- JudgeLRM: Large Reasoning Models as a Judge☆41Jan 29, 2026Updated last month
- Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning☆15Jun 28, 2025Updated 8 months ago
- ☆160Nov 24, 2025Updated 3 months ago
- The original Shared Recurrent Memory Transformer implementation☆34Jul 11, 2025Updated 8 months ago