Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Data Debiasing via Model-free Data Pruning
Lei Hsiung · Yaoqing Yang
Keywords: [ subpopulation shift ] [ Data pruning ] [ Data Debiasing ]
Addressing dataset bias is crucial for developing fair and reliable machine-learning models. However, previous debiasing methods typically tackle this problem by re-weighting data samples during model training, necessitating knowledge of attribution information. We argue that having complete attribute information for datasets is unrealistic, leading previous methods to exacerbate bias in unspecified groups inadvertently. In this paper, we propose CG Pruning, a novel approach leveraging the Complexity Gap (CG) score for data valuation to mitigate dataset biases. By pruning data samples with low CG scores, our method effectively reduces spurious correlations, attribute imbalance, and class imbalance in the dataset. Experimental results on the Waterbirds dataset demonstrate the efficacy of CG Pruning in improving model performance across various learning algorithms, achieving higher testing accuracy and worst-group accuracy.