Invited Talk by Adam Block: When Best-of-N is Worse: Coverage, Reward Hacking, and Pessimism in Inference-Time Alignment
Abstract
Inference-time scaling has emerged as a powerful way to improve language models, but its simplest form, Best-of-N (BoN) sampling, has a delicate failure mode. While additional samples can substantially improve outputs, in the presence of imperfect verifiers, BoN can amplify reward-hacking by over-optimizing to the learned reward model. In this talk, we will discuss a perspective on inference-time alignment centered on this tension. I will argue that the performance of BoN is governed by three key ingredients: the coverage of the reference model, the quality of the reward model, and the metric used to measure success. We will see that, under appropriate coverage assumptions, BoN can be optimal for improving expected reward, but that these guarantees do not by themselves rule out sever reward hacking or poor win-rate performance. This helps explain why seemingly contradictory conclusions can all be correct, depending on what objective is being optimized and how inference-time success is measured. Motivated by this, I will then discuss pessimistic alternatives to standard BoN, including methods that use uncertainty estimates at inference time to penalize suspicious high-reward responses. These approaches retain the benefits of additional inference-time compute while mitigating over-optimization to flawed verifiers. Overall, the talk will argue that making language models better at inference time is not just a question of scaling up search, but of understanding when extra compute is pointed in the wrong direction.