Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Implicit Regularization of Gradient Flow for One-layer Softmax Attention
Heejune Sheen · Siyu Chen · Tianhao Wang · Huibin Zhou
We study Gradient Flow (GF) on the exponential loss for a classification problem with one-layer softmax attention, where the key and query weight matrices are trained separately.Under the separability assumption on the data, we show that GF implicitly minimizes the nuclear norm of the product of key and query weight matrices, which can be described through a support vector machine problem with respect to the attention weights.This finding contrasts with prior results obtained when the key and query matrices are combined into a single weight matrix, where gradient descent implicitly minimizes the Frobenius norm.For diagonal key and query matrices, our analysis builds upon reparameterization techniques and exploitation of approximate KKT conditions.We further extend the results for more general key and query matrices with proper alignment of the singular spaces at initialization.