hkproj / rlhf-ppo
Notes and commented code for RLHF (PPO)
☆85Updated last year
Alternatives and similar repositories for rlhf-ppo:
Users that are interested in rlhf-ppo are comparing it to the libraries listed below
- Direct Preference Optimization from scratch in PyTorch☆90Updated last week
- ☆84Updated 6 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆170Updated this week
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)☆97Updated this week
- ☆137Updated 4 months ago
- ☆105Updated 2 months ago
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆195Updated 3 weeks ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆90Updated last month
- ☆118Updated 9 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆180Updated 3 weeks ago
- ☆101Updated 4 months ago
- RLHF implementation details of OAI's 2019 codebase☆186Updated last year
- A simplified implementation for experimenting with Reinforcement Learning (RL) on GSM8K, inspired by RLVR and Deepseek R1. This repositor…☆74Updated 2 months ago
- Reasoning with Language Model is Planning with World Model☆163Updated last year
- ☆278Updated last month
- Survey of Small Language Models from Penn State, ...☆171Updated 3 months ago
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆108Updated last week
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆199Updated 11 months ago
- ☆99Updated 2 weeks ago
- ☆91Updated last month
- A brief and partial summary of RLHF algorithms.☆127Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆354Updated 7 months ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆203Updated 2 years ago
- Minimal hackable GRPO implementation☆206Updated 2 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆301Updated 8 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆175Updated this week
- RewardBench: the first evaluation tool for reward models.☆553Updated last month
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆122Updated 9 months ago
- A Comprehensive Survey on Long Context Language Modeling☆129Updated 3 weeks ago
- ☆255Updated last year