Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Training Dynamics of Multi-Head Softmax Attention: Emergence, Convergence, and Optimality
Siyu Chen · Heejune Sheen · Zhuoran Yang · Tianhao Wang
Abstract:
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression, where the key and query weights are trained separately instead of being combined into a single weight matrix. We prove the convergence of gradient flow for suitable choices of initialization. As a byproduct of the convergence analysis, we illstrate the emergence bahavior of the attention heads that for each task, there will be an optimal head that suddenly dominates the attention output after a warm-up stage. We further characterize the optimality of the solution found by gradient flow and show that it highly depends on certain symmetry properties of the initialization.
Chat is not available.