Poster
in
Workshop: The Future of Machine Learning Data Practices and Repositories

Rethinking Dataset Pruning From A Generalization Perspective

Furui Xu · Shaobo Wang · Luo Zhongwei · Linfeng Zhang

Project Page [ OpenReview]

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. To address this problem, dataset pruning aims to construct an informative coreset from the full dataset with comparable performance. Previous dataset pruning methods are mostly based on the performance of samples during the training (i.e., fitting) phase. In this paper, we rethink dataset pruning from the perspective of generalization, i.e. scoring samples based on models that have not been trained on them. We propose a plug-and-play framework UNSEEN, which can be integrated into existing dataset pruning methods. For instance, the simplest Entropy method achieves accuracy comparable to state-of-the-art (SOTA) methods under our framework. We validate our method on various datasets including CIFAR-10, CIFAR-100, and ImageNet-1K to demonstrate its effectiveness.

Chat is not available.