ICLR Training Dynamics of Multi-Head Softmax Attention: Emergence, Convergence, and Optimality

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Training Dynamics of Multi-Head Softmax Attention: Emergence, Convergence, and Optimality

Siyu Chen · Heejune Sheen · Zhuoran Yang · Tianhao Wang

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression, where the key and query weights are trained separately instead of being combined into a single weight matrix. We prove the convergence of gradient flow for suitable choices of initialization. As a byproduct of the convergence analysis, we illstrate the emergence bahavior of the attention heads that for each task, there will be an optimal head that suddenly dominates the attention output after a warm-up stage. We further characterize the optimality of the solution found by gradient flow and show that it highly depends on certain symmetry properties of the initialization.

Chat is not available.

Poster in Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Training Dynamics of Multi-Head Softmax Attention: Emergence, Convergence, and Optimality

Siyu Chen · Heejune Sheen · Zhuoran Yang · Tianhao Wang

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning