Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

OUTLIER-AWARE PREFERENCE OPTIMIZATION FOR LARGE LANGUAGE MODELS

Pragya Srivastava · Sai Nalli · Amit Jayant Deshpande · Amit Sharma

Keywords: [ Alignment ] [ DPO ] [ Preference Optimization ] [ OOD ]


Abstract:

Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energy-based out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the MT-Bench benchmark and improves the reward model’s judgment capability on RewardBench.

Chat is not available.