Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Bias of AdamW: -Norm Constrained Optimization

Shuo Xie · Zhiyuan Li


Abstract: Adam with decoupled weight decay, a.k.a. AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with 2 regularization in terms of generalization and optimization. The theoretical underpinnings of this improvement, however, remain elusive. One of the challenges is the ambiguity surrounding whether AdamW optimizes a specific objective, unlike its 2 regularization counterpart which clearly targets an 2 regularized loss.In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, that should AdamW converge with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss constrained by the norm of the parameter being limited by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to norm, and a surprising connection between normalized steepest descent with weight decay to Frank-Wolfe.

Chat is not available.