Sublinear Time Quantum Algorithm for Attention Approximation
Zhao Song · Jianfei Xue · Jiahao Zhang · Lichen Zhang
Abstract
Given the query, key and value matrices $Q, K, V\in \mathbb{R}^{n\times d}$, the attention matrix is defined as $\mathrm{Att}(Q, K, V)=D^{-1}AV$ where $A=\exp(QK^\top/\sqrt{d})$ with $\exp(\cdot)$ applied entrywise, $D=\mathrm{diag}(A{\bf 1}_n)$. The attention matrix is the backbone of modern transformers and large language models, but explicitly forming the softmax matrix $D^{-1}A$ incurs $\Omega(n^2)$, motivating numerous approximation schemes that reduce runtime to $\widetilde O(nd)$ via sparsity or low-rank factorization. We propose a quantum data structure that approximates any row of $\mathrm{Att}(Q, K, V)$ using only row queries to $Q, K, V$. Our algorithm preprocesses these matrices in $\widetilde{O}\left( \epsilon^{-1} n^{0.5} \left( s_\lambda^{2.5} + s_\lambda^{1.5} d + \alpha^{0.5} d \right) \right)$ time, where $\epsilon$ is the target accuracy, $s_\lambda$ is the $\lambda$-statistical dimension of the exponential kernel defined by $Q$ and $K$, and $\alpha$ measures the row distortion of $V$ that is at most $d/{\rm srank}(V)$, the stable rank of $V$. Each row query can be answered in $\widetilde{O}(s_\lambda^2 + s_\lambda d)$ time. To our knowledge, this is the first quantum data structure that approximates rows of the attention matrix in sublinear time with respect to $n$. Our approach relies on a quantum Nystr{\"o}m approximation of the exponential kernel, quantum multivariate mean estimation for computing $D$, and quantum leverage score sampling for the multiplication with $V$.
Successful Page Load