When do Score-Based Data Valuation Methods Work, and Why?
Kumar Kshitij Patel ⋅ Sai Karimireddy ⋅ Raul Fernandez ⋅ Manolis Zampetakis
Abstract
Score-based valuation methods, such as Shapley-style scores and Leave-one-out (LOO), are widely used for credit assignment in data markets, yet theory offers limited guidance on when and why these methods succeed. In this paper, we study these methods using the best subset selection problem. We show that, even with monotone-submodular valuation functions, selection using LOO and Shapley-style scores cannot achieve a constant-factor approximation due to duplicate archetypes and collapsed pointwise credit. We also find that boundary effects in canonical learning problems can lead to supermodular spikes, preventing any valuation method$-$including adaptive methods like greedy selection$-$from achieving a constant-factor approximation. We identify two conditions that avert these failure modes: (i) bounded curvature, which controls redundancy and restores guarantees for LOO and Shapley-style scores, and (ii) coverage, which yields approximate submodularity on top of a sufficiently rich core. Our theoretical results and experiments motivate a practical algorithmic pipeline: deduplicate, ensure coverage, then apply score-based selection at an appropriate granularity.
Chat is not available.
Successful Page Load