Towards a Foundation Model for Crowdsourced Label Aggregation
Abstract
Inferring ground truth from noisy, crowdsourced labels is a fundamental challenge in machine learning. For decades, the dominant paradigm has relied on dataset-specific parameter estimation, a non-scalable method that fails to transfer knowledge. Recent efforts toward universal aggregation models do not account for the structural and behavioral complexities of human-annotated crowdsourcing, resulting in poor real-world performance. To address this gap, we introduce CrowdFM, a foundation model for crowdsourced label aggregation. At its core, CrowdFM is a bipartite graph neural network that is pre-trained on a vast, domain-randomized synthetic dataset. By leveraging a size-invariant initialization and attention-based message passing, it learns universal principles of collective intelligence and generalizes to new, unseen datasets. Extensive experiments on 22 real-world benchmarks show that our single, fixed model consistently matches or surpasses bespoke, per-dataset methods in both accuracy and efficiency. Furthermore, the representations learned by CrowdFM readily support diverse downstream applications, such as worker assessment and task assignment. Code and pre-trained models will be made publicly available upon acceptance.