hkproj / rlhf-ppo
Notes and commented code for RLHF (PPO)
☆77Updated last year
Alternatives and similar repositories for rlhf-ppo:
Users that are interested in rlhf-ppo are comparing it to the libraries listed below
- Direct Preference Optimization from scratch in PyTorch☆89Updated last year
- ☆116Updated 9 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆165Updated last week
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆148Updated last week
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)☆83Updated this week
- ☆103Updated 2 months ago
- ☆91Updated 3 months ago
- A brief and partial summary of RLHF algorithms.☆127Updated 3 weeks ago
- ☆102Updated 3 months ago
- ☆260Updated last week
- ☆166Updated last month
- augmented LLM with self reflection☆117Updated last year
- Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)☆53Updated 7 months ago
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.☆104Updated last week
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆119Updated 8 months ago
- ☆143Updated 3 months ago
- This is the code of MMOA-RAG.☆44Updated last week
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆299Updated 7 months ago
- ☆128Updated last week
- ☆83Updated 2 weeks ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆131Updated last month
- ☆136Updated 4 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆191Updated 11 months ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆74Updated 2 weeks ago
- A continually updated list of literature on Reinforcement Learning from AI Feedback (RLAIF)☆158Updated 2 months ago
- ☆60Updated 4 months ago
- ☆19Updated 3 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Model☆120Updated 3 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆138Updated 2 weeks ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆229Updated last month