ICLR Poster How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Spotlight Poster

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Josh Alman · Zhao Song

Halle B #247

[ Abstract ]

[ OpenReview]

Abstract: In the classical transformer attention scheme, we are given three

n \times d

$n \times d$ size matrices

Q, K, V

$Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new

n \times d

$n \times d$ size matrix

D^{- 1} \exp (Q K^{⊤}) V

$D^{-1} \exp(QK^\top) V$ where

D = d i a g (\exp (Q K^{⊤}) 1_{n})

$D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$ . Here,

\exp ()

$\exp()$ is applied entry-wise and

1_{n}

${\bf 1}_n$ denotes a length-

n

$n$ vector whose entries are all ones.Intuitively, attention computation captures pairwise information between words in a sentence, but not higher-order information. Indeed, recent work \cite{sht23} has shown that attention units cannot solve simple problems about detecting triples of connected words.In this work, we study a generalization of attention which captures triple-wise correlations. The generalization is based on computations involving tensors defined by tuples of words. More formally, given five

n \times d

$n \times d$ size matrices

Q, K_{1}, K_{2}, V_{1}

$Q, K_1, K_2, V_1$ and

V_{2}

$V_2$ (generalized query, key, and value tokens), our new goal is to compute an

n \times d

$n \times d$ size matrix

D^{- 1} \exp (Q (K_{1} ⊘ K_{2})^{⊤}) (V_{1} ⊘ V_{2})

$D^{-1} \exp( Q ( K_1 \oslash K_2)^\top ) (V_1 \oslash V_2)$ where

D = d i a g (\exp (Q (K_{1} ⊘ K_{2})^{⊤}) 1_{n^{2}})

$D = \mathrm{diag}( \exp( Q ( K_1 \oslash K_2)^\top ) {\bf 1}_{n^2} )$ and

K_{1} ⊘ K_{2} \in R^{n^{2} \times d}

$K_1 \oslash K_2 \in \mathbb{R}^{n^2 \times d}$ denotes the column-wise Kronecker product of

K_{1}

$K_1$ and

K_{2}

$K_2$ . This generalization is indeed able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers.The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in

n

$n$ . However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations:

∙

$\bullet$ On the positive side, if all entries of the input matrices are bounded above by

o (\sqrt[3]{\log n})

$o(\sqrt[3]{\log n})$ then we show how to approximate the

tensor-type'' attention matrix in

n^{1 + o (1)}

$n^{1+o(1)}$ time.

∙

$\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as

Ω (\sqrt[3]{\log n})

$\Omega(\sqrt[3]{\log n})$ , then there is no algorithm that runs faster than

n^{3 - o (1)}

$n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory).We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.Our constructions make use of a novel connection with a higher-order variant on the kernel density estimation problem. They combine a number of technical tools, including the polynomial method, algebraic geometry codes, and multiparty Merlin-Arthur communication protocols.

Chat is not available.