Diffusion Policy Optimization without Drifting Apart
Abstract
RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double drift phenomenon: optimizing a variational surrogate can let the ELBO separate from true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose DiPOD, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards against previous methods.