[Short]ACTIVE L EARNING FOR S CALABLE DATA S ELECTION IN I NSTRUCTION T UNING
Abstract
Selecting high-quality training data can substantially reduce the computational cost of instruction-tuning language models, as carefully curated datasets often yield models that outperform those trained on much larger, noisier corpora. Most existing automated data selection methods for instruction tuning, however, operate in a single step and remain static throughout training. Inspired by ideas from active learning, we study iterative data selection for instruction tuning, where the training subset is updated over multiple iterations. To mitigate the computational overhead typically associated with large language models, we further show that a significantly smaller model can be used to guide data selection at negligible cost while remaining competitive on downstream tasks. Through a case study on LLaMA 3 8B (Grattafiori et al., 2024) , we demonstrate that our adaptive selection algorithm consistently matches or outperforms random selection across a diverse suite of downstream benchmarks, while using fewer training examples.