Are Easier or Harder Examples Better? Rethinking Data Selection for Reward Models and Preference Optimization
Kevin Christian Wibisono ⋅ Aya Ismail ⋅ Pedro O Pinheiro ⋅ Yixin Wang ⋅ Kyunghyun Cho ⋅ Natasa Tagasovska ⋅ Rajesh Ranganath
Abstract
Despite being crucial for effective LLM alignment, data selection remains understudied. Prior work examining data selection for reward model (RM) training and policy optimization methods (e.g. Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)) has identified *example difficulty*, measured by the reward gap between chosen and rejected responses, as a key factor. However, findings are contradictory: some studies favor easier examples with larger gaps, while others prefer harder ones. To isolate the role of difficulty from confounding factors, we *assume access to an oracle RM* and systematically study data selection across RM, DPO, and GRPO training. We find that *training on easier pairs consistently leads to better performance* than harder ones, particularly for smaller base models. This advantage persists even when reward estimates are noisy. Notably, using only the top 20\% easiest examples often matches or exceeds full-dataset performance while reducing post-training costs by 5$\times$.
Chat is not available.
Successful Page Load