Poster
in
Workshop: Scientific Methods for Understanding Deep Learning (Sci4DL)

Optimal learning rate scaling depends on data in deep scalar linear networks

Yedi Zhang ⋅ Peter Latham ⋅ Leena Chennuru Vankadara ⋅ Andrew Saxe

Project Page [ OpenReview]

Abstract

We study the gradient descent dynamics of deep scalar linear networks, which enjoy exact time-course solutions for any integer depth. We show that even in this minimal model, the optimal depth-wise learning rate scaling depends on data, whereas data-agnostic scaling rules fail to transfer across depths. Under the data-dependent optimal scaling, the learning dynamics is independent of data and weakly dependent on depth, resulting in a constant linear convergence rate across all depths including infinity. We further show similar data-dependent effects in deep scalar linear networks with residual connections.

Chat is not available.