Poster
The Optimization Landscape of SGD Across the Feature Learning Strength
Alexander Atanasov · Alexandru Meterez · James Simon · Cengiz Pehlevan
Hall 3 + Hall 2B #610
[
Abstract
]
Wed 23 Apr 7 p.m. PDT
— 9:30 p.m. PDT
Abstract:
We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γ. Recent work has identified γ as controlling the strength of feature learning.As γ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks.In this work, we conduct a thorough empirical investigation of the effect of scaling γ across a variety of models and datasets in the online training setting.We first examine the interaction of γ with the learning rate η, identifying several scaling regimes in the γ-η plane which we explain theoretically using a simple model.We find that the optimal learning rate η∗ scales non-trivially with γ. In particular, η∗∝γ2 when γ≪1 and η∗∝γ2/L when γ≫1 for a feed-forward network of depth L.Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored ultra-rich'' γ≫1 regime.We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps.We find networks of different large γ values optimize along similar trajectories up to a reparameterization of time.We further find that optimal online performance is often found at large γ and could be missed if this hyperparameter is not tuned.Our findings indicate that analytical study of the large-γ limit may yield useful insights into the dynamics of representation learning in performant models.
Live content is unavailable. Log in and register to view live content