Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Counting on Algorithmic Capacity: The Interplay between Mixing and Memorization in Toy Models of Transformers
Freya Behrens · Luca Biggio · Lenka Zdeborova
Transformer models are witnessing a formidable rise in the deep learning community. Their very flexible architecture, characterized by the interplay of attention and feed-forward modules, has made them the model of choice for several different problems, extending beyond natural language processing. To elucidate the inductive bias of the model, a large body of research has focused on the analysis of the attention mechanism in isolation, ignoring how this components interacts with the other elements of the architecture, most importantly the feed-forward modules. In this work, we take a step back, and we interpret Transformers as one instance of a broader class of models, alternating an arbitrary token-mixing mechanism and a feature transformation. Using a simple counting task as a prototypical problem, we study two members of this class, namely attention and a simple linear mixing model. Despite the apparent simplicity of the task under consideration, our analysis highlights a number of interesting properties of transformer-like models. Firstly, the choice of the mixing module results in drastically different algorithmic strategies employed by the architecture to solve the tasks. Secondly, we interpret the feed-forward block as a memorization module, requiring appropriate scaling, contingent on the type of token-mixing mechanism, to attain sufficient memorization capacity. Third, by varying the key hyperparameters of the problem, e.g. vocabulary size, model dimension and number of hidden neurons in the feed-forward block, we identify multiple learning regimes where the models exhibit marked differences in algorithmic behavior.