How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study
Abstract
Neural scaling laws are widely used for performance projection and resource planning, yet their sensitivity to data quality interventions remains poorly understood. We present the first large-scale empirical study of how interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape scaling behavior in large language model training. Using QualityPajama, a suite of 23 systematically curated datasets, we train over 2,000 models (100M–8B parameters, 100M–200B tokens) to measure how text quality interventions affects scaling-law parameters and compute-optimal design decisions. While prior studies have shown that model architecture primarily shifts coefficients, we demonstrate that data interventions shift both coefficients and exponents, fundamentally changing the fitted scaling laws in ways not anticipated by existing theory. We show that data quality ranking is scale and resource-dependent. Compute-optimal token–to-parameter ratios vary by orders of magnitude across interventions, revealing a fundamental data quality–quantity trade-off in scaling. These findings pave the way for deeper theoretical understanding of scaling laws, establish scaling-law analysis as a principled framework for data strategy evaluation and ranking, and motivate a data-quality–aware approach to scaling next-generation LLMs.