Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT

Pretrain Value, Not Reward: Decoupled Value Policy Optimization

Chenghua Huang ⋅ Lu Wang ⋅ Fangkai Yang ⋅ Pu Zhao ⋅ Qingwei Lin ⋅ Dongmei Zhang ⋅ Saravan Rajmohan

[ OpenReview]

Abstract

In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling. Building on this insight, we introduce \emph{Decoupled Value Policy Optimization} (DVPO), a framework that pretrains a \emph{Global Value Model} (GVM) offline and freezes it as a universal critic for policy learning. The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling. Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods. These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model. The implementation code for our method is available in \url{https://github.com/microsoft/DKI_LLM/tree/main/dvpo}

Video

Chat is not available.