Poster
in
Workshop: 3rd Workshop on Navigating and Addressing Data Problems For Foundation Models (DATA-FM)

Less is More: Adaptive Coverage Sampling for Synthetic Training Data

Sasan Tavakkol ⋅ Max Springer ⋅ MohammadHossein Bateni ⋅ Vincent Cohen-Addad ⋅ Neslihan Bulut ⋅ MohammadTaghi Hajiaghayi

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) enable rapid generation of synthetic training data for downstream classifiers, offering a solution when human-labeled data is costly, scarce, or time-sensitive. However, synthetic datasets suffer from systematic redundancy: LLMs over-generate common patterns while under representing nuanced edge cases, leading to training inefficiency and degraded generalization. We introduce Adaptive Coverage Sampling (ACS), a principled method that formulates synthetic data selection as a graph-based maximum coverage problem over semantic similarity. By constructing a similarity graph with adaptively tuned thresholds and applying greedy approximation, ACS identifies maximally diverse, representative subsets without requiring iterative model training or expensive quality scoring. We demonstrate a striking ``less is more'' phenomenon across sentiment analysis, relation extraction, and named entity recognition tasks: classifiers trained on ACS-selected subsets comprising just 10-30\% of the original synthetic data match or exceed the performance of models trained on full datasets. This dramatic data reduction translate directly to computational savings in fine-tuning costs while improving model generalization through enhanced diversity. Our results establish that carefully curated synthetic data systematically outperforms naive utilization of large, redundant corpora, and that intelligent subset selection is essential for effective synthetic data utilization.

Chat is not available.