Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Efficient Top-m Data Values Identification for Data Selection

Xiaoqiang Lin · Xinyi Xu · See-Kiong Ng · Bryan Kian Hsiang Low

Hall 3 + Hall 2B #618
[ ]
Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: Data valuation has found many real-world applications, e.g., data pricing and data selection. However, the most adopted approach -- Shapley value (SV) -- is computationally expensive due to the large number of model trainings required. Fortunately, most applications (e.g., data selection) require only knowing the m data points with the highest data values (i.e., top-m data values), which implies the potential for fewer model trainings as exact data values are not required. Existing work formulates top-m Shapley value identification as top-m arms identification in multi-armed bandits (MAB). However, the proposed approach falls short because it does not utilize data features to predict data values, a method that has been shown empirically to be effective. A recent top-m arms identification work does consider the use of arm features while assuming a linear relationship between arm features and rewards, which is often not satisfied in data valuation. To this end, we propose the GPGapE algorithm that uses the Gaussian process to model the \emph{non-linear} mapping from data features to data values, removing the linear assumption. We theoretically analyze the correctness and stopping iteration of GPGapE in finding an (ϵ,δ)-approximation to the top-m data values. We further improve the computational efficiency, by calculating data values using small data subsets to reduce the computation cost of model training. We empirically demonstrate that GPGapE outperforms other baselines in top-m data values identification, noisy data detection, and data subset selection on real-world datasets. We also demonstrate the efficiency of our GPGapE in data selection for large language model fine-tuning.

Live content is unavailable. Log in and register to view live content