Train Smarter, Not Longer: Memorization-Guided Data Reuse for Efficient LLM Training
Jingwei Zuo ⋅ Ilyas Chahed ⋅ Maksim Velikanov ⋅ Cong Zeng ⋅ Dhia Rhaiem ⋅ Pasquale Balsebre ⋅ Abhay Kumar ⋅ Younes Belkada ⋅ Hakim Hacid
Abstract
The training paradigm of large language models has shifted from traditional one-pass training to multi-epoch training, as reasonable reuse of limited high-quality data can improve both model performance and sample efficiency. Meanwhile, excessive repetition introduces the risk of overfitting and diminishing returns. Determining when and how to reuse data effectively thus emerges as a natural but under-explored question. Through a novel observation of model's $\textit{Memorization Window}$ signals derived from loss retention dynamics and downstream evaluation scores, we propose $\textit{Memorization-guided Data Reuse}$, a training paradigm that adaptively determines $\textit{when}$ and $\textit{how}$ data should be reused, enabling principled decisions on the number of training epochs and the scheduling of data replays. Our preliminary experiments reveal a consistent memorization-driven regime: performance continues to improve with repetition far beyond current practice (e.g., the commonly cited four-epoch limit). While a full scheduler remains future work, these insights provide a foundation for memorization-aware training schedules, helping to determine reuse budgets and move toward training LLMs $\textit{smarter rather than longer}$ with limited high-quality data.
Chat is not available.
Successful Page Load