Poster
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment
OUTLIER-AWARE PREFERENCE OPTIMIZATION FOR LARGE LANGUAGE MODELS
Pragya Srivastava · Sai Nalli · Amit Jayant Deshpande · Amit Sharma
Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability onRewardBench.