Layer Normalization Improves Length Generalization
Ruining Li · Gabrijel Boduljak · Jinghao Zhou
Abstract
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time.This raises the question of whether Transformer models are real _reasoning_ engines, despite their impressive abilities in mathematical problem solving and code synthesis.In this paper, we offer a _vanishing variance_ perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the $\operatorname{argmax}$ retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention output leads to significantly better length generalization.Our analyses attribute this improvement to a reduction---though not a complete elimination---of the distribution shift caused by vanishing variance.
Video
Chat is not available.
Successful Page Load