ICLR Poster Taming Transformer Without Using Learning Rate Warmup

Poster

Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi · Yelin He · Jiaquan Ye · Chun-Guang Li · Bojia Zi · Xili Dai · Qin Zou · Rong Xiao

Hall 3 + Hall 2B #372

[ Abstract ] [ Project Page ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and an obviously lower learning rate, is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for training Transformer and reveal a key problem behind the model crash phenomenon in the training, \ie, the spectral energy concentration of

$W_q^{\top} W_k$ (where

$W_q$ and

$W_k$ are the projection matrices for query and key in Transformer), which is the reason for a malignant entropy collapse. To remedy this problem, motivated by Weyl's Inequality, we present a novel optimization strategy---making weight updating in successive steps smooth, that is, if the ratio

$\frac{\sigma_{1}(\nabla W_t)}{\sigma_{1}(W_{t-1})}$ is larger than a threshold, where

$\nabla W_t$ is the updating quantity in step

$t$ , we will automatically bound the learning rate to a weighted multiply of

$\frac{\sigma_{1}(W_{t-1})}{\sigma_{1}(\nabla W_t)}$ . Our optimization strategy is able to prevent the rapid spectral energy concentration to only a few directions, and thus is able to avoid the malignant entropy collapse that will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these (Transformer) models without using learning rate warmup.

Live content is unavailable. Log in and register to view live content