ICLR Poster The Optimization Landscape of SGD Across the Feature Learning Strength

Poster

The Optimization Landscape of SGD Across the Feature Learning Strength

Alexander Atanasov · Alexandru Meterez · James Simon · Cengiz Pehlevan

Hall 3 + Hall 2B #610

[ Abstract ]

Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter

$\gamma$ . Recent work has identified

$\gamma$ as controlling the strength of feature learning.As

$\gamma$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks.In this work, we conduct a thorough empirical investigation of the effect of scaling

$\gamma$ across a variety of models and datasets in the online training setting.We first examine the interaction of

$\gamma$ with the learning rate

$\eta$ , identifying several scaling regimes in the

$\gamma$ -

$\eta$ plane which we explain theoretically using a simple model.We find that the optimal learning rate

$\eta^*$ scales non-trivially with

$\gamma$ . In particular,

$\eta^* \propto \gamma^2$ when

$\gamma \ll 1$ and

$\eta^* \propto \gamma^{2/L}$ when

$\gamma \gg 1$ for a feed-forward network of depth

$L$ .Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored

ultra-rich''

$\gamma \gg 1$ regime.We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps.We find networks of different large

$\gamma$ values optimize along similar trajectories up to a reparameterization of time.We further find that optimal online performance is often found at large

$\gamma$ and could be missed if this hyperparameter is not tuned.Our findings indicate that analytical study of the large-

$\gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.

Live content is unavailable. Log in and register to view live content