Skip to yearly menu bar Skip to main content

Affinity Workshop: Tiny Papers Oral Session 3

Policy Optimization in RLHF: The Impact of Out-of-preference Data

Ziniu Li · Tian Xu · Yang Yu


Aligning agents with human preferences is important. This paper examines two types of alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data (i.e., prompts or so-called states). We examine the impact of such out-of-preference data through synthetic contextual bandit problems. Our study suggests that RMB-PO+ outperforms the other two approaches. In particular, even when providing the policy model with a good feature representation, we find that policy optimization with adequate out-of-preference data significantly improves performance by harnessing the reward model's generalization capabilities. We present an analysis based on stochastic approximation and relate our results with other research, including imitation learning and reinforcement learning.

Chat is not available.