The Softmax Bottleneck Does Not Limit the Probabilities of the Most Likely Tokens
Ronen Basri ⋅ David Jacobs
Abstract
In many popular transformer architectures, an output projection matrix linearly maps lower-dimensional embeddings into a higher-dimensional space of logits. It has been shown that this leads to a softmax bottleneck that prevents the production of arbitrary probability distributions. It has been argued that this limits large language models (LLMs) in their ability to express next token probabilities that perfectly align with the statistics of natural language. We focus on the ability of such models to produce accurate probabilities for just the top-$m$ tokens. We provide theoretical bounds that show that even a randomly initialized projection matrix can successfully do this for rather large values of $m$, supported by empirical results on both random and trained matrices. This raises questions about whether the softmax bottleneck significantly limits the capabilities of LLMs. We also derive bounds on the maximum number of probabilities that any trained output projection matrix can specify.
Successful Page Load