Does LLM Pre-Training Typically Occur at the Edge of Stability?
Yuhang Cai ⋅ Haofeng Huang ⋅ Haodong Wen ⋅ Deyi Liu ⋅ Yiyuan Ma ⋅ Kaifeng Lyu
Abstract
Quadratic approximations are a common lens for neural network optimization, but recent evidence challenges their predictive validity. In full-batch gradient descent with LR $\eta$, Cohen et al. (2021) observed the Edge of Stability (EoS), where the largest Hessian eigenvalue concentrates near $2/\eta$, in tension with classical stability conditions. In this work, we revisit the fidelity of quadratic approximation as a model of neural network training dynamics, with particular focus on its failure modes in LLM training. We first identify and decouple a distinct failure mechanism of the quadratic approximation regardless of the LR choice, which arises from persistent negative curvature during training, which we term the *Edge of Convexity* (EoC). Based on the decoupling from EoC, we then extend the definition of EoS to large-scale stochastic training with adaptive optimizers. Across different LLM pretraining with various model sizes up to $1.7$B, we find: (1) EoC is always observed across LLM pretraining. (2) EoS is also prevalent but not universal; it disappears when the LR becomes sufficiently small (e.g., after decay) or when the batch size falls below a critical threshold that is linearly related to the critical batch size. Together, these findings characterize when and how quadratic approximations fail and serve as foundations for future work on understanding the training dynamics of modern neural networks.
Chat is not available.
Successful Page Load