ICLR Poster The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Poster

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin · Ahmed Imtiaz Humayun · Utku Evci · Suvinay Subramanian · Amir Yazdanbakhsh · Dan Alistarh · Gintare Karolina Dziugaite

Hall 3 + Hall 2B #342

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Parameter pruning has emerged as a promising technique to address the growing computational demand of large language models (LLMs). While many studies focus on post-training pruning of LLMs, sparse pre-training offers a compelling alternative: sparsifying during pre-training reduces both training and inference costs. In this work, we conduct the first comprehensive study on optimal sparse pre-training configurations for LLMs, exploring various pruning schedules across different sparsity levels and training duration. We evaluate 80 unique configurations and find that a pruning schedule starting at 25% of total training compute and ending at 75% achieves near-optimal final evaluation loss. Our findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average number of active parameters during training. We present both empirical and theoretical evidence that this modification accurately models evaluation loss for both sparsely and densely pre-trained LLMs, thus offering a unified scaling law for dense and sparse model training. Our insights suggest that, while sparse pre-training yields similar model loss as dense pre-training for the same compute budget, it offers a clear advantage: the final model is smaller, resulting in significant potential computational savings during inference.

Live content is unavailable. Log in and register to view live content