Poster
Rethinking Reward Modeling in Preference-based Large Language Model Alignment
Hao Sun · Yunyi Shen · Jean-Francois Ton
Hall 3 + Hall 2B #270
[
Abstract
]
[ Project Page ]
Oral
presentation:
Oral Session 4A
Fri 25 Apr 12:30 a.m. PDT — 2 a.m. PDT
Sat 26 Apr midnight PDT
— 2:30 a.m. PDT
Fri 25 Apr 12:30 a.m. PDT — 2 a.m. PDT
Abstract:
The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear *why* this model --- originally developed for multi-player stochastic game matching --- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use.Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization, this is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of *order consistency* in reward modeling and demonstrate that the BT model possesses this property.Moreover, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using 6 base LLMs, 2 datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.
Live content is unavailable. Log in and register to view live content