Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis
Abstract
A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50\% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81\% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. \textcolor{blue}{These analyses suggest that robustness is heavily influenced by two key principles}: the \textbf{richness of conditioning information}, which constrains the learning problem, and the \textbf{absolute information content} of the training data, which allows the signal from correct information to dominate statistical noise.