Poster
kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers
Themistoklis Haris
Hall 3 + Hall 2B #131
[
Abstract
]
Thu 24 Apr midnight PDT
— 2:30 a.m. PDT
Abstract:
Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like -Nearest-Neighbor (NN) attention have been introduced [Roy et al., 2017], enabling each token to attend to only its closest tokens. While NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann et al., 2017] with NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling techniques, such as Markov Chain-based estimation. Finally, we demonstrate the practical effectiveness of these algorithms through empirical experiments, showcasing their benefits in both training and inference.
Live content is unavailable. Log in and register to view live content