Skip to yearly menu bar Skip to main content


Virtual presentation / poster accept

Provable Memorization Capacity of Transformers

Junghwan Kim · Michelle Kim · Barzan Mozafari

Keywords: [ contextual mapping ] [ deep learning theory ] [ transformer ] [ Expressivness ] [ Permutation Equivariance ] [ Memorization ] [ Deep Learning and representational learning ]


Abstract: Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing $N$ sequence-to-sequence mappings of length $n$ with $d$-dimensional input tokens using $\tilde{O}(d + n + \sqrt{nN})$ parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification and language modeling tasks. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain.

Chat is not available.