A Noise is Worth Diffusion Guidance
Abstract
Diffusion models have demonstrated remarkable image generation capabilities, but their performance heavily relies on classifier-free guidance (CFG). While CFG significantly enhances image quality, evaluating both conditional and unconditional models at every denoising step leads to substantial computational overhead. Existing approaches mitigate this cost through distillation, training a student network to learn the guided predictions. In contrast, we take an orthogonal approach by refining the \textit{initial Gaussian noise}, a critical yet under-explored factor in the diffusion-based generation pipelines. Recent studies have explored noise optimization for specific tasks such as layout-conditioned generation and human preference alignment. However, whether refined noise alone can enable guidance-free high-quality image generation remains an open question. We introduce a noise refinement framework where a refining network is trained to minimize the difference between images generated by unguided sampling from the refined noise and those produced by guided sampling from the input Gaussian noise. Our method achieves CFG-like quality without modifying the diffusion model, preserving its prior knowledge and compatibility with techniques like DreamBooth LoRA. Additionally, the learned refining network generalizes across domains without retraining and seamlessly integrates with existing distilled models, further improving sample quality. Beyond its practical benefits, we provide an in-depth analysis of refined noise, offering insights into its role in the denoising process and its interaction with guidance. Our findings suggest that structured noise initialization is key to efficient and high-fidelity image synthesis. Code and weights will be publicly released.