Reward Alignment for Flow and Diffusion Models: Fine-Tuning, Trajectory Control, and a Unifying Perspective
Abstract
Recent progress in flow and diffusion models has opened multiple routes to reward alignment, but the landscape is fragmented across inference-time control versus training-time fine-tuning, and across first-order versus zeroth-order reward signals. In this talk, I will present a unified view through three recent works from our group. First, I will present a first-order inference-time alignment method for training-free reward-guided image editing, which formulates editing as a trajectory optimal control problem and optimizes the reverse trajectory via iterative adjoint-state updates. Second, I will present PCPO, a zeroth-order policy-gradient method that identifies disproportionate timestep credit assignment as a key source of instability and improves convergence and sample quality through a stable reformulation and proportionate reweighting. Third, I will discuss Reward Score Matching (RSM), which shows that many reward-based fine-tuning methods for flow-based models can be understood as matching a reward-guided score, clarifying the roles of value-guidance estimation, temporal weighting, and trust-region design. Together, these works suggest a common perspective on reward alignment as reward-guided control of generative trajectories, spanning both model updates and test-time trajectory optimization.