Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Bias of AdamW: $\ell_\infty$-Norm Constrained Optimization

Shuo Xie · Zhiyuan Li


Abstract: Adam with decoupled weight decay, a.k.a. AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. The theoretical underpinnings of this improvement, however, remain elusive. One of the challenges is the ambiguity surrounding whether AdamW optimizes a specific objective, unlike its $\ell_2$ regularization counterpart which clearly targets an $\ell_2$ regularized loss.In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, that should AdamW converge with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss constrained by the $\ell_\infty$ norm of the parameter being limited by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay to Frank-Wolfe.

Chat is not available.