princeton-nlp / unintentional-unalignmentView external linksLinks
[ICLR 2025] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
☆32Jan 7, 2026Updated last month
Alternatives and similar repositories for unintentional-unalignment
Users that are interested in unintentional-unalignment are comparing it to the libraries listed below
Sorting:
- Demo code of ACMMM 2022 "Quality Assessment of Image Super-Resolution: Balancing Deterministic and Statistical Fidelity"☆14Oct 13, 2022Updated 3 years ago
- ☆15Mar 30, 2025Updated 10 months ago
- [NeurIPS 2025] What Makes a Reward Model a Good Teacher? An Optimization Perspective☆42Sep 18, 2025Updated 4 months ago
- ☆17Nov 30, 2022Updated 3 years ago
- Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"☆16Aug 11, 2023Updated 2 years ago
- Pytorch implementation of the paper 'Compositional language emerge in a neural iterated learning' (ICLR 2020).☆16Oct 14, 2021Updated 4 years ago
- Code and data from the paper 'Human Feedback is not Gold Standard'☆20Jul 9, 2024Updated last year
- Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"☆79Jun 10, 2025Updated 8 months ago
- Deep Weighted Averaging Classifiers☆23Feb 4, 2019Updated 7 years ago
- Code to accompany the paper "The Information Geometry of Unsupervised Reinforcement Learning"☆20Oct 6, 2021Updated 4 years ago
- ☆20Nov 4, 2025Updated 3 months ago
- Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".☆29Oct 30, 2024Updated last year
- Code for paper "Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?"☆22Oct 13, 2020Updated 5 years ago
- ☆20Dec 16, 2020Updated 5 years ago
- ☆26Nov 8, 2022Updated 3 years ago
- [NeurIPS 2025] Reinforcement Learning for Reasoning in Large Language Models with One Training Example☆408Nov 21, 2025Updated 2 months ago
- AI4Science: Efficient data-driven Online Model Learning (OML) / system identification and control☆33Oct 20, 2022Updated 3 years ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Mar 22, 2024Updated last year
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆35Dec 17, 2024Updated last year
- ☆28Feb 17, 2024Updated last year
- ☆29Jan 23, 2024Updated 2 years ago
- Code for Paper (Preserving Diversity in Supervised Fine-tuning of Large Language Models)☆51May 12, 2025Updated 9 months ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Apr 20, 2024Updated last year
- ☆34May 9, 2025Updated 9 months ago
- [ICLR 2026] Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs☆41May 20, 2025Updated 8 months ago
- ☆34Nov 21, 2023Updated 2 years ago
- Code and data for the paper "Understanding Hidden Context in Preference Learning: Consequences for RLHF"☆33Dec 14, 2023Updated 2 years ago
- The official code to build up dataset PMC-OA☆34Jul 16, 2024Updated last year
- Official Code for All-in-One Medical Image Re-Identification (CVPR2025)☆18Jan 11, 2026Updated last month
- Code for the paper "SMACE: A New Method for the Interpretability of Composite Decision Systems", ECML 2022☆15Apr 17, 2023Updated 2 years ago
- Code for the paper "Distinguishing the Knowable from the Unknowable with Language Models"☆11Apr 15, 2024Updated last year
- ☆11Mar 11, 2024Updated last year
- Code for "Depth Uncertainty in Neural Networks" (https://arxiv.org/abs/2006.08437)☆78Oct 3, 2023Updated 2 years ago
- Enhanced Explainable Neural Network☆10Dec 25, 2021Updated 4 years ago
- This repo contains the code to reproduce figures in my dissertation "Passive Imaging and Characterization of the Subsurface With Distribu…☆10Jun 14, 2018Updated 7 years ago
- [ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!☆39Aug 2, 2024Updated last year
- noiseless/nonnegative sparse recovery and feature retrieval via compressed sensing☆39May 8, 2018Updated 7 years ago
- ☆34Aug 30, 2021Updated 4 years ago
- Code for ICML 25 paper "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"☆49Jun 30, 2025Updated 7 months ago