Skip to yearly menu bar Skip to main content


Poster

Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

Lei Chen · Joan Bruna · Alberto Bietti

Hall 3 + Hall 2B #141
[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.

Live content is unavailable. Log in and register to view live content