Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS
Layer Normalization Improves Length Generalization
Ruining Li · Gabrijel Boduljak · Jinghao Zhou
Keywords: [ Transformers ] [ attention ] [ long-context generalisation ]
Abstract:
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time.This raises the question of whether Transformer models are real _reasoning_ engines, despite their impressive abilities in mathematical problem solving and code synthesis.In this paper, we offer a _vanishing variance_ perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the $\operatorname{argmax}$ retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention output leads to significantly better length generalization.Our analyses attribute this improvement to a reduction---though not a complete elimination---of the distribution shift caused by vanishing variance.
Chat is not available.
Successful Page Load