Scaling-Law Analysis of SignSGD: From Feature-Space Linear Regression to LLM Pre-training
Abstract
Despite their widespread use in deep learning, the mechanisms underlying the effectiveness of adaptive gradient methods in large-scale training remain poorly understood. In this work, we provide a scaling-law analysis of SignSGD, a minimal yet expressive optimizer that captures the core coordinate-wise adaptivity shared by more sophisticated adaptive methods. We consider feature-space linear regression with power-law spectra, which allows us to precisely characterize the training dynamics of SignSGD. Specifically, we derive explicit scaling laws for SignSGD that accurately describe the loss dynamics. By further analyzing the data-limited regime, we characterize the phase diagram of SignSGD training and quantify the superiority of SignSGD in data scaling. We also show that SignSGD admits a substantially larger critical batch size than SGD, which gives SignSGD more benefits from large-batch training. Finally, we systematically validate our theoretical predictions through large-scale LLM pre-training experiments, demonstrating that the scaling laws uncovered here extend beyond the controlled model and are predictive of practical training behavior.