ICLR Poster Mechanism and emergence of stacked attention heads in multi-layer transformers

Poster

Mechanism and emergence of stacked attention heads in multi-layer transformers

Tiberiu Mușat

Hall 3 + Hall 2B #536

[ Abstract ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.

Live content is unavailable. Log in and register to view live content