Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Implicit Bias of AdamW: -Norm Constrained Optimization
Shuo Xie · Zhiyuan Li
Abstract:
Adam with decoupled weight decay, a.k.a. AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with regularization in terms of generalization and optimization. The theoretical underpinnings of this improvement, however, remain elusive. One of the challenges is the ambiguity surrounding whether AdamW optimizes a specific objective, unlike its regularization counterpart which clearly targets an regularized loss.In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, that should AdamW converge with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss constrained by the norm of the parameter being limited by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to norm, and a surprising connection between normalized steepest descent with weight decay to Frank-Wolfe.
Chat is not available.