raghavc / LLM-RLHF-Tuning-with-PPO-and-DPO

Comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF) training, featuring instruction fine-tuning, reward model training, and support for PPO and DPO algorithms with various configurations for the Alpaca, LLaMA, and LLaMA2 models.
118Updated 8 months ago

Related projects

Alternatives and complementary repositories for LLM-RLHF-Tuning-with-PPO-and-DPO