Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Alexandru Meterez · Lorenzo Noci · Thomas Hofmann · Antonio Orvieto
Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mu P and its depth extension), then some hyperparameters --- such as the learning rate --- exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is remarkably consistent across very different model sizes. In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under \mu P and its depth extension, the largest eigenvalue of the training loss Hessian (i.e. the sharpness) is largely independent of the width and depth of the network across training. Furthermore, the sharpness dynamics are remarkably consistent across width for a long training time, an important component for learning rate transfer. On the other hand, we show that under the NTK regime, the network's lack of feature learning causes the sharpness to have very different dynamics at different scales, thus preventing learning rate transfer. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on Wikitext.