Poster
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
Shawn Tan · Songlin Yang · Aaron Courville · Rameswar Panda · Yikang Shen
Hall 3 + Hall 2B #131
Abstract:
The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order.But current methods using still face length generalisation challenges.We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.The method works as follows: For each token before the current, we determine a break point, which represents the proportion of the stick, the weight of the attention, to allocate to the current token.We repeat this on the remaining stick, until all tokens are allocated a weight, resulting in a sequence of attention weights.This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et al., 2017).We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism.When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks.Stick-breaking also performs well at length generalisation, allowing a model trained with context window to perform well at with perplexity improvements.
Live content is unavailable. Log in and register to view live content