Poster
Exploring Learning Complexity for Efficient Downstream Dataset Pruning
Wenyu Jiang · Zhenlong Liu · Zejian Xie · Songxin Zhang · Bingyi Jing · Hongxin Wei
Hall 3 + Hall 2B #575
[
Abstract
]
Thu 24 Apr 7 p.m. PDT
— 9:30 p.m. PDT
Abstract:
The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance.However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models.In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently.Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization.Based on DLC, we further design a flexible under-sampling strategy with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift.Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach.In the images pruning benchmark, DLC significantly reduces the pruning time by 35 while establishing state-of-the-art performance with FlexRand.
Live content is unavailable. Log in and register to view live content