ICLR Implicit Bias of AdamW: $\ell_\infty$-Norm Constrained Optimization

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Bias of AdamW: $\ell_\infty$ -Norm Constrained Optimization

Shuo Xie · Zhiyuan Li

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Adam with decoupled weight decay, a.k.a. AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with

ℓ_{2}

$\ell_2$ regularization in terms of generalization and optimization. The theoretical underpinnings of this improvement, however, remain elusive. One of the challenges is the ambiguity surrounding whether AdamW optimizes a specific objective, unlike its

ℓ_{2}

$\ell_2$ regularization counterpart which clearly targets an

ℓ_{2}

$\ell_2$ regularized loss.In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, that should AdamW converge with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss constrained by the

ℓ_{\infty}

$\ell_\infty$ norm of the parameter being limited by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to

ℓ_{\infty}

$\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay to Frank-Wolfe.

Chat is not available.

Poster in Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Bias of AdamW: ℓ∞ℓ∞\ell_\infty-Norm Constrained Optimization

Shuo Xie · Zhiyuan Li

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Bias of AdamW: $\ell_\infty$ -Norm Constrained Optimization