TRIM: TOKEN-BUDGETED DATA MINING FOR INSTRUCTION TUNING
Abstract
Data filtering is often assumed to improve instruction tuning, but practitioners rarely control for token budget—the most binding constraint in small-scale finetuning. We introduce TRIM (Token-budget-Regulated Instruction data Mining), a simple pipeline that enforces an explicit token cap during data selection, and use it to study: at a fixed training-token budget, do common filtering heuristics beat random selection? Using a 800,000-token cap over a mixed instruction dataset (Alpaca + Dolly), we compare (i) random selection, (ii) near-duplicate removal via MinHash, and (iii) MinHash + embedding-based diversity selection. We finetune TinyLlama/TinyLlama-1.1B-Chat-v1.0 with QLoRA/LoRA-style adapters and evaluate held-out perplexity plus a lightweight toxicity probe. In this regime, both deduplication variants match but do not improve on random selection: held-out PPL differs by < 0.2% and random is slightly best. Our takeaway is deliberately modest: under tight token budgets and standard instruction corpora, “obvious” deduplication/diversity steps may be lower-impact than expected unless paired with stronger quality signals, larger budgets, or multi-seed evaluation