raghavc / LLM-RLHF-Tuning-with-PPO-and-DPO
Comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF) training, featuring instruction fine-tuning, reward model training, and support for PPO and DPO algorithms with various configurations for the Alpaca, LLaMA, and LLaMA2 models.
β146Updated last year
Alternatives and similar repositories for LLM-RLHF-Tuning-with-PPO-and-DPO:
Users that are interested in LLM-RLHF-Tuning-with-PPO-and-DPO are comparing it to the libraries listed below
- πΎ OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.β283Updated this week
- β103Updated 2 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ83Updated last week
- β141Updated 10 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'β186Updated 3 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learningβ148Updated last week
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)β83Updated this week
- A simplified implementation for experimenting with Reinforcement Learning (RL) on GSM8K, inspired by RLVR and Deepseek R1. This repositorβ¦β72Updated last month
- β111Updated last month
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"β462Updated last year
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and trainingβ260Updated 10 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β310Updated 3 months ago
- Minimal hackable GRPO implementationβ187Updated last month
- Train your own SOTA deductive reasoning modelβ81Updated 3 weeks ago
- Self-playing Adversarial Language Game Enhances LLM Reasoning, NeurIPS 2024β124Updated last month
- This is work done by the Oxen.ai Community, trying to reproduce the Self-Rewarding Language Model paper from MetaAI.β125Updated 4 months ago
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systemsβ75Updated 3 weeks ago
- β260Updated last week
- RewardBench: the first evaluation tool for reward models.β532Updated last month
- β307Updated 9 months ago
- An extension of the nanoGPT repository for training small MOE models.β106Updated 2 weeks ago
- Official repository for ORPOβ445Updated 9 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.β168Updated 2 months ago
- Controlled Text Generation via Language Model Arithmeticβ216Updated 6 months ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".β74Updated 2 weeks ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)β205Updated 10 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"β104Updated 6 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasksβ141Updated 6 months ago
- β160Updated 2 weeks ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"β229Updated last month