Optimal learning rate scaling depends on data in deep scalar linear networks
Yedi Zhang ⋅ Peter Latham ⋅ Leena Chennuru Vankadara ⋅ Andrew Saxe
Abstract
We study the gradient descent dynamics of deep scalar linear networks, which enjoy exact time-course solutions for any integer depth. We show that even in this minimal model, the optimal depth-wise learning rate scaling depends on data, whereas data-agnostic scaling rules fail to transfer across depths. Under the data-dependent optimal scaling, the learning dynamics is independent of data and weakly dependent on depth, resulting in a constant linear convergence rate across all depths including infinity. We further show similar data-dependent effects in deep scalar linear networks with residual connections.
Chat is not available.
Successful Page Load