Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Enhancing Data Quality in Federated Fine-Tuning of Foundation Models
Wanru Zhao · Yaxin Du · Nic Lane · Siheng Chen · Yanfeng Wang
Keywords: [ data quality ] [ privacy ] [ large language model ]
In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated foundation model training. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.