Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Abstract
Reinforcement Learning from Human Feedback (RLHF) improves pretrained generative models, and its sampling design is important for training reliable, high-quality models. In practice, stochastic SDE samplers promote exploration during training, while deterministic ODE samplers enable fast, stable inference; this creates a discrepancy in sampling stochasticity that induces a preference-reward gap. In this paper, we establish a non-vacuous bound on this gap for general diffusion models and a sharper bound for Variance Exploding (VE) and Variance Preserving (VP) models with (mixture) Gaussian data. Methodologically, we leverage the stochastic gDDIM scheme to attain arbitrarily high stochasticity while preserving data marginals, and we evaluate, under multiple preference rewards, the performance of RL algorithms (e.g., log-likelihood and group-relative policy variants). Our numerical experiments validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.