Less is More: Adaptive Coverage Sampling for Synthetic Training Data
Abstract
Large Language Models (LLMs) enable rapid generation of synthetic training data for downstream classifiers, offering a solution when human-labeled data is costly, scarce, or time-sensitive. However, synthetic datasets suffer from systematic redundancy: LLMs over-generate common patterns while under representing nuanced edge cases, leading to training inefficiency and degraded generalization. We introduce Adaptive Coverage Sampling (ACS), a principled method that formulates synthetic data selection as a graph-based maximum coverage problem over semantic similarity. By constructing a similarity graph with adaptively tuned thresholds and applying greedy approximation, ACS identifies maximally diverse, representative subsets without requiring iterative model training or expensive quality scoring. We demonstrate a striking ``less is more'' phenomenon across sentiment analysis, relation extraction, and named entity recognition tasks: classifiers trained on ACS-selected subsets comprising just 10-30\% of the original synthetic data match or exceed the performance of models trained on full datasets. This dramatic data reduction translate directly to computational savings in fine-tuning costs while improving model generalization through enhanced diversity. Our results establish that carefully curated synthetic data systematically outperforms naive utilization of large, redundant corpora, and that intelligent subset selection is essential for effective synthetic data utilization.