Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: Blog Track Session 4

Masked Language Model with ALiBi and CLAP head

Jason Chou

Halle B #3

Abstract:

As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length.

Chat is not available.