A Geometric Perspective on Recursive Synthetic Training
Abstract
Scaling high-quality datasets to improve generative model quality is effective, but is becoming increasingly challenging due to data scarcity and contamination. Trying to alleviate this by naively bootstrapping generative models by training on synthetic data results in significant quality degradation and a collapse in sample diversity. In this paper, we study the negative effects of synthetic data on the geometry of deep generative networks (DGNs) to understand how to go beyond naive synthetic data utilization. Through empirical simulations, we show that retraining on synthetic data leads to DGNs with low-quality singular vectors and input-output Jacobians with low effective rank. Using these insights, we develop a strategy to generate synthetic data from a DGN to improve its quality through negative guidance.