Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
Abstract
Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods like muP focus primarily on transfer between model sizes, with transfer across batch sizes and training horizons commonly relying on empirical scaling rules. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating the bound as a proxy and minimizing it under different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as a function of the iteration budget K or token budget T. Our preliminary analysis suggests a non-trivial token-optimal batch size and provides budget transfer rules for extrapolating hyperparameters tuned at a small budget to longer runs.