Skip to yearly menu bar Skip to main content


Poster

Transformers are Universal In-context Learners

Takashi Furuya · Maarten V de Hoop · Gabriel Peyré

Hall 3 + Hall 2B #430
[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Transformers are deep architectures that define in-context mappings'' which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. This result implies, as a special case, that transformers are universal approximators for continuous permutation-invariant mappings over a fixed number of tokens. It also establishes the universal approximation capability of transformers for certain in-context learning tasks, demonstrating in particular their ability to perform regression within context. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.

Live content is unavailable. Log in and register to view live content