Decoupling Tilting from Transport: Stable Online Alignment of Flow and Diffusion Policies
Chubin Zhang ⋅ Zhenglin Wan ⋅ Feng Chen ⋅ Fuchao Yang ⋅ Lang Feng ⋅ Yaxin Zhou ⋅ Xingrui Yu ⋅ Yang You ⋅ Ivor Tsang ⋅ Bo An
Abstract
Expressive generative models, such as diffusion and flow matching, have shown great promise in representing multimodal distributions for continuous control. However, aligning these models with dynamic reward signals via online reinforcement learning (RL) remains a formidable challenge, primarily due to intractable likelihoods and the instability of propagating gradients through long sampling chains. In this work, we introduce GoRL (Generative Online Reinforcement Learning), a framework that achieves stable *reward-guided* alignment by structurally decoupling optimization from generation. We view online improvement as *reward-guided distribution tilting*, and realize it by decoupling *tilting from transport*: GoRL confines the alignment process to a tractable latent space—effectively learning a tractable steering policy—while delegating complex action synthesis to a conditional generative decoder. Crucially, unlike methods that steer fixed backbones, GoRL *co-evolves* the tilting and transport mechanisms on two timescales. We employ a **prior-anchored refinement** strategy that prevents collapse by forcing the transport map to progressively expand its support to cover the high-reward modes discovered by the latent policy. Empirically, GoRL demonstrates superior stability and performance in aligning flow- and diffusion-based policies, achieving episodic returns exceeding $3\times$ that of strong baselines on challenging tasks like HopperStand.
Chat is not available.
Successful Page Load