Poster
Adam Exploits ℓ∞ℓ∞-geometry of Loss Landscape via Coordinate-wise Adaptivity
Shuo Xie · Mohamad Amin Mohamadi · Zhiyuan Li
Hall 3 + Hall 2B #426
[
Abstract
]
Thu 24 Apr 7 p.m. PDT
— 9:30 p.m. PDT
Abstract:
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps T and is already minimax-optimal in non-convex cases, which are both ˜O(T−1/4). In this work, we argue that the exploitation of nice ℓ∞-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under ℓ∞-geometry rather than the more common ℓ2-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable ℓ∞-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.
Live content is unavailable. Log in and register to view live content