Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

The Optimization Landscape of SGD Across the Feature Learning Strength

Alexander Atanasov · Alexandru Meterez · James Simon · Cengiz Pehlevan

Hall 3 + Hall 2B #610
[ ]
Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γ. Recent work has identified γ as controlling the strength of feature learning.As γ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks.In this work, we conduct a thorough empirical investigation of the effect of scaling γ across a variety of models and datasets in the online training setting.We first examine the interaction of γ with the learning rate η, identifying several scaling regimes in the γ-η plane which we explain theoretically using a simple model.We find that the optimal learning rate η scales non-trivially with γ. In particular, ηγ2 when γ1 and ηγ2/L when γ1 for a feed-forward network of depth L.Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored ultra-rich'' γ1 regime.We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps.We find networks of different large γ values optimize along similar trajectories up to a reparameterization of time.We further find that optimal online performance is often found at large γ and could be missed if this hyperparameter is not tuned.Our findings indicate that analytical study of the large-γ limit may yield useful insights into the dynamics of representation learning in performant models.

Live content is unavailable. Log in and register to view live content