Poster
in
Workshop: The Future of Machine Learning Data Practices and Repositories
Rethinking Dataset Pruning From A Generalization Perspective
Furui Xu · Shaobo Wang · Luo Zhongwei · Linfeng Zhang
The growing scale of datasets in deep learning has introduced significant computational challenges. To address this problem, dataset pruning aims to construct an informative coreset from the full dataset with comparable performance. Previous dataset pruning methods are mostly based on the performance of samples during the training (i.e., fitting) phase. In this paper, we rethink dataset pruning from the perspective of generalization, i.e. scoring samples based on models that have not been trained on them. We propose a plug-and-play framework UNSEEN, which can be integrated into existing dataset pruning methods. For instance, the simplest Entropy method achieves accuracy comparable to state-of-the-art (SOTA) methods under our framework. We validate our method on various datasets including CIFAR-10, CIFAR-100, and ImageNet-1K to demonstrate its effectiveness.