Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Towards Understanding the Universality of Transformers for Next-Token Prediction

Michael Sander · Gabriel Peyré

Hall 3 + Hall 2B #379
[ ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract: Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token xt+1 given an autoregressive sequence (x1,,xt) as a prompt, where xt+1=f(xt), and f is a context-dependent function that varies with each sequence.On the theoretical side, we focus on specific instances, namely when f is linear or when (xt) is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping f in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates xt+1 based solely on past and current observations (x1,,xt), with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings f.

Live content is unavailable. Log in and register to view live content