On the "Induction Bias" in Sequence Models
Abstract
Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking, in particular in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We empirically compare the data efficiency of transformers and recurrent neural networks (RNNs) and find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.