ICLR Poster Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Poster

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Cassidy Laidlaw · Shivam Singhal · Anca Dragan

Hall 3 + Hall 2B #395

[ Abstract ] [ Project Page ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using flawed proxy rewards that seem to capture the true objective. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. We introduce a definition of reward hacking based on correlation between proxy and true rewards for states and actions seen by a "base policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). We then show theoretically that regularization to the base policy can effectively prevent reward hacking. Our theory suggests regularizing

$\chi^2$ divergence between the policies' occupancy measures, rather than the current practice in RLHF of using a KL penalty between action distributions. We intuitively show why this type of regularization is better, and demonstrate that it outperforms alternatives at mitigating reward hacking in practice across four realistic settings, including RLHF.

Live content is unavailable. Log in and register to view live content