hkproj / rlhf-ppo
Notes and commented code for RLHF (PPO)
☆90Updated last year
Alternatives and similar repositories for rlhf-ppo:
Users that are interested in rlhf-ppo are comparing it to the libraries listed below
- Direct Preference Optimization from scratch in PyTorch☆91Updated last month
- ☆122Updated 10 months ago
- A continually updated list of literature on Reinforcement Learning from AI Feedback (RLAIF)☆163Updated 3 months ago
- ☆287Updated last month
- A brief and partial summary of RLHF algorithms.☆128Updated 2 months ago
- ☆138Updated 5 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆209Updated last year
- Minimal hackable GRPO implementation☆217Updated 3 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆176Updated 3 weeks ago
- ☆153Updated last month
- A simplified implementation for experimenting with Reinforcement Learning (RL) on GSM8K, inspired by RLVR and Deepseek R1. This repositor…☆84Updated 3 months ago
- ☆192Updated 2 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆123Updated 9 months ago
- ☆109Updated 3 months ago
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)☆103Updated 3 weeks ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆205Updated 2 years ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"☆167Updated 3 weeks ago
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆220Updated last month
- This is a repo for showcasing using MCTS with LLMs to solve gsm8k problems☆75Updated last month
- minimal GRPO implementation from scratch☆87Updated last month
- ☆85Updated 7 months ago
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆261Updated 7 months ago
- Reference implementation for Token-level Direct Preference Optimization(TDPO)☆138Updated 2 months ago
- MPO: Boosting LLM Agents with Meta Plan Optimization☆50Updated 2 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆141Updated 2 weeks ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆459Updated last year
- ☆63Updated 5 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Model☆131Updated 4 months ago
- ☆95Updated last month
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆306Updated 9 months ago