princeton-nlp / unintentional-unalignmentLinks

[ICLR 2025] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

☆29

Alternatives and similar repositories for unintentional-unalignment

Users that are interested in unintentional-unalignment are comparing it to the libraries listed below

Sorting:

deeplearning-wisc / args
☆40Updated last year
abhishekpanigrahi1996 / Skill-Localization-by-grafting
☆49Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
haotiansun14 / BBox-Adapter
Lightweight Adapting for Black-Box Large Language Models
☆22Updated last year
Jiuzhouh / Uncertainty-Aware-Language-Agent
This is the official repo for Towards Uncertainty-Aware Language Agent.
☆25Updated 10 months ago
sail-sg / dice
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆44Updated 2 months ago
RLHFlow / Directional-Preference-Alignment
Directional Preference Alignment
☆57Updated 9 months ago
Edward-Sun / easy-to-hard
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
☆121Updated 9 months ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆60Updated 7 months ago
Shentao-YANG / Preference_Grounded_Guidance
Source codes for "Preference-grounded Token-level Guidance for Language Model Fine-tuning" (NeurIPS 2023).
☆16Updated 5 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆72Updated 3 months ago
janphilippfranken / sami
Self-Supervised Alignment with Mutual Information
☆19Updated last year
GAIR-NLP / alignment-for-honesty
☆74Updated last year
chujiezheng / LLM-MCQ-Bias
Official repository for ICLR 2024 Spotlight paper "Large Language Models Are Not Robust Multiple Choice Selectors"
☆39Updated last month
SumilerGAO / SunGen
☆27Updated 2 years ago
activatedgeek / calibration-tuning
☆51Updated 2 months ago
RUCAIBox / RLMEC
The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
☆38Updated last year
roeehendel / icl_task_vectors
☆95Updated last year
Hritikbansal / jpo
☆13Updated 6 months ago
srzer / MOD
Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".
☆24Updated 7 months ago
RLHFlow / RAFT
This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…
☆32Updated 9 months ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆59Updated last year
alexrame / rewardedsoups
Rewarded soups official implementation
☆58Updated last year
RZFan525 / Awesome-ScalingLaws
A curated list of awesome resources dedicated to Scaling Laws for LLMs
☆72Updated 2 years ago
zepingyu0512 / in-context-mechanism
code for EMNLP 2024 paper: How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for M…
☆12Updated 7 months ago
GAIR-NLP / Preference-Dissection
☆25Updated last year
princeton-pli / what-makes-good-rm
What Makes a Reward Model a Good Teacher? An Optimization Perspective
☆32Updated 2 months ago
tianjunz / TEMPERA
☆44Updated 2 years ago
ruiqi-zhong / nlparam
Augmenting Statistical Models with Natural Language Parameters
☆27Updated 9 months ago
ChicagoHAI / active-example-selection
Active Example Selection for In-Context Learning (EMNLP'22)
☆49Updated 11 months ago