Positive Mining from LLM Seeds: A Semi-Supervised Graph Based Approach to Train Rare Event Classifiers
Abstract
Detecting rare events, from emerging hate speech to novel fraud patterns, presents a fundamental cold-start challenge: without labeled examples, we cannot train classifiers, and manually searching vast unlabeled corpora for rare instances is prohibitively expensive. This paper introduces SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a framework that bridges Large Language Models and graph-based learning to efficiently bootstrap rare event detection from scratch. Rather than using synthetic data for direct model training, SYNAPSE-G employs LLM-generated examples as intelligent ``seeds'' to efficiently probe large unlabeled datasets. These seeds initialize a semi-supervised label propagation process over a similarity graph, identifying real candidate instances for oracle verification. We provide a theoretical analysis connecting the quality of synthetic seeds, specifically their validity (accuracy) and diversity (coverage), to the precision and recall of discovered positives, revealing a nuanced trade-off between these properties. Through systematic evaluation on imbalanced SST2 and Measuring Hate Speech datasets, we demonstrate that SYNAPSE-G discovers 28.6\% of rare positives while querying only 2.4\% of data, substantially outperforming standard active learning baselines. Our work establishes design principles for combining synthetic data generation with graph-based discovery in extreme class imbalance scenarios.