cassidylaidlaw / orpoLinks
☆19Updated last year
Alternatives and similar repositories for orpo
Users that are interested in orpo are comparing it to the libraries listed below
Sorting:
- ☆46Updated 2 years ago
- Self-Supervised Alignment with Mutual Information☆21Updated last year
- Rewarded soups official implementation☆62Updated 2 years ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity☆47Updated last year
- This is the official repo for Towards Uncertainty-Aware Language Agent.☆29Updated last year
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆35Updated last year
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Updated last year
- ☆27Updated 2 years ago
- ☆46Updated last year
- Code for "Reasoning to Learn from Latent Thoughts"☆122Updated 7 months ago
- Code for Paper (Preserving Diversity in Supervised Fine-tuning of Large Language Models)☆47Updated 6 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆44Updated 7 months ago
- Directional Preference Alignment☆57Updated last year
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆125Updated last year
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆69Updated 7 months ago
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆27Updated last year
- A curated list of resources on Reinforcement Learning with Verifiable Rewards (RLVR) and the reasoning capability boundary of Large Langu…☆76Updated last month
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Updated last year
- [NeurIPS 2025] What Makes a Reward Model a Good Teacher? An Optimization Perspective☆39Updated 2 months ago
- ☆85Updated last year
- Domain-specific preference (DSP) data and customized RM fine-tuning.☆25Updated last year
- ☆16Updated last year
- ☆20Updated 2 weeks ago
- Advantage Leftover Lunch Reinforcement Learning (A-LoL RL): Improving Language Models with Advantage-based Offline Policy Gradients☆26Updated last year
- Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs☆39Updated last year
- ☆104Updated last year
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆37Updated 5 months ago
- Code for "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"☆24Updated last month
- Source code for the TMLR paper "Black-Box Prompt Learning for Pre-trained Language Models"☆56Updated 2 years ago
- Reproduction of "RLCD Reinforcement Learning from Contrast Distillation for Language Model Alignment☆69Updated 2 years ago