Poster
in
Workshop: ReALM-GEN: Real-World Constrained and Preference-Aligned Flow- and Diffusion-based Generative Models Mon, Apr 27, 2026 • 7:35 AM – 8:15 AM PDT

Diffusion Policy Optimization without Drifting Apart

Haozhe Jiang ⋅ Haiwen Feng ⋅ Jiantao Jiao ⋅ Angjoo Kanazawa ⋅ Nika Haghtalab

Project Page [ OpenReview]

Abstract

RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double drift phenomenon: optimizing a variational surrogate can let the ELBO separate from true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose DiPOD, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards against previous methods.

Chat is not available.