ICLR Poster Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Poster

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur · Chirag Nagpal · Adam Fisch · Xinyang Geng · Jacob Eisenstein · Rishabh Agarwal · Alekh Agarwal · Jonathan Berant · Aviral Kumar

Hall 3 + Hall 2B #548

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. With the goal of using PRMs to improve a base policy via test-time search and reinforcement learning (RL), we ask: How should we design process rewards?'' Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, as measured under a prover policy distinct from the base policy. Such progress values can {distinguish} good and bad steps generated by the base policy, even though the base policy itself cannot. Theoretically, we show that even weaker provers can improve the base policy, as long as they distinguish steps without being too misaligned with the base policy. Our results show that process rewards defined as progress under such provers improve the efficiency of exploration during test-time search and online RL. We empirically validate our claims by training process advantage verifiers (PAVs) to measure progress under such provers and show that compared to ORM, they are >8% more accurate, and 1.5-5x more compute-efficient. Equipped with these insights, our PAVs enable one of the first results showing a 6x gain in sample efficiency for a policy trained using online RL with PRMs vs. ORMs.

Live content is unavailable. Log in and register to view live content