This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity
☆47Jan 19, 2024Updated 2 years ago
Alternatives and similar repositories for rlfh-gen-div
Users that are interested in rlfh-gen-div are comparing it to the libraries listed below
Sorting:
- Rewarded soups official implementation☆62Sep 27, 2023Updated 2 years ago
- Directed masked autoencoders☆14Feb 20, 2026Updated 2 weeks ago
- Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment☆16Aug 6, 2024Updated last year
- ☆21Jun 22, 2025Updated 8 months ago
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Dec 19, 2023Updated 2 years ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Apr 20, 2024Updated last year
- Overcooked-AI Experiment Psiturk Demo (for MTurk experiments)☆12May 10, 2021Updated 4 years ago
- Measuring the Signal to Noise Ratio in Language Model Evaluation☆28Aug 19, 2025Updated 6 months ago
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch☆13Jun 21, 2023Updated 2 years ago
- RewardBench: the first evaluation tool for reward models.☆702Feb 16, 2026Updated 3 weeks ago
- JudgeLRM: Large Reasoning Models as a Judge☆41Jan 29, 2026Updated last month
- ☆18Jul 24, 2023Updated 2 years ago
- [NeurIPS 2025] Reasoning Models Better Express Their Confidence"☆22Nov 19, 2025Updated 3 months ago
- ☆14Jul 24, 2024Updated last year
- ☆160Nov 23, 2024Updated last year
- Simple notebooks to learn diffusion models on toy datasets☆17Feb 9, 2023Updated 3 years ago
- Recipes to train reward model for RLHF.☆1,518Apr 24, 2025Updated 10 months ago
- Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique☆18Aug 22, 2024Updated last year
- ☆16Mar 22, 2024Updated last year
- ☆19Jul 24, 2023Updated 2 years ago
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆191Jan 16, 2025Updated last year
- Un-*** 50 billions multimodality dataset☆23Sep 14, 2022Updated 3 years ago
- Paper List for In-context Learning 🌷☆19Jan 3, 2023Updated 3 years ago
- A paper list of self-supervised pretrain method☆22Aug 15, 2025Updated 6 months ago
- ☆26May 30, 2023Updated 2 years ago
- This is code to accompany the paper "Accelerating Exploration with Unlabeled Prior Data".☆25Dec 5, 2023Updated 2 years ago
- Code relative to "Adversarial robustness against multiple and single $l_p$-threat models via quick fine-tuning of robust classifiers"☆19Nov 30, 2022Updated 3 years ago
- [𝐄𝐌𝐍𝐋𝐏 𝐅𝐢𝐧𝐝𝐢𝐧𝐠𝐬 𝟐𝟎𝟐𝟒 & 𝐀𝐂𝐋 𝟐𝟎𝟐𝟒 𝐍𝐋𝐑𝐒𝐄 𝐎𝐫𝐚𝐥] 𝘌𝘯𝘩𝘢𝘯𝘤𝘪𝘯 𝘨 𝘔𝘢𝘵𝘩𝘦𝘮𝘢𝘵𝘪𝘤𝘢𝘭 𝘙𝘦𝘢𝘴𝘰𝘯𝘪𝘯…☆51May 4, 2024Updated last year
- ☆25May 16, 2024Updated last year
- VQVAE | VAE | GumbelVAE | PixelCNN☆21Jun 15, 2020Updated 5 years ago
- A minimal implementation of Drifting Models for 2D toy data. Unlike diffusion/flow models that iterate at inference, drifting models evo…☆64Feb 13, 2026Updated 3 weeks ago
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆66Apr 24, 2024Updated last year
- Data Valuation on In-Context Examples (ACL23)☆24Jan 12, 2025Updated last year
- [ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution☆235Feb 28, 2026Updated last week
- A recipe for online RLHF and online iterative DPO.☆542Dec 28, 2024Updated last year
- Code and links for over 25,000 trained Atari agents☆98Aug 22, 2024Updated last year
- Minimal RLHF implementation built on top of minGPT.☆32Jul 4, 2024Updated last year
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Aug 25, 2023Updated 2 years ago
- Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation☆76Aug 2, 2023Updated 2 years ago