Why Adversarially Train Diffusion Models?
Abstract
Adversarial Training (AT) is a known, powerful, well-established technique for improving classifier robustness to input perturbations, yet its applicability beyond discriminative settings remains limited. Motivated by the widespread use of score-based generative models and their need to operate robustly under substantial noisy or corrupted input data, we propose an adaptation of AT for these models, providing a thorough empirical assessment. We introduce a principled formulation of AT for Diffusion Models (DMs) that replaces the conventional invariance objective with an equivariance constraint aligned to the denoising dynamics of score matching. Our method integrates seamlessly into diffusion training by adding either random perturbations--similar to randomized smoothing--or adversarial ones--akin to AT. Our approach offers several advantages: (a) tolerance to heavy noise and corruption, (b) reduced memorization, (c) robustness to outliers and extreme data variability and (d) resilience to iterative adversarial attacks. We validate these claims on proof-of-concept low- and high-dimensional datasets with known ground-truth distributions, enabling precise error analysis. We further evaluate on standard benchmarks (CIFAR-10, CelebA, and LSUN Bedroom), where our approach shows improved robustness and preserved sample fidelity under severe noise, data corruption, and adversarial evaluation. Code available upon acceptance.