ICLR Implicit Regularization of Gradient Flow for One-layer Softmax Attention

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Regularization of Gradient Flow for One-layer Softmax Attention

Heejune Sheen · Siyu Chen · Tianhao Wang · Huibin Zhou

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We study Gradient Flow (GF) on the exponential loss for a classification problem with one-layer softmax attention, where the key and query weight matrices are trained separately.Under the separability assumption on the data, we show that GF implicitly minimizes the nuclear norm of the product of key and query weight matrices, which can be described through a support vector machine problem with respect to the attention weights.This finding contrasts with prior results obtained when the key and query matrices are combined into a single weight matrix, where gradient descent implicitly minimizes the Frobenius norm.For diagonal key and query matrices, our analysis builds upon reparameterization techniques and exploitation of approximate KKT conditions.We further extend the results for more general key and query matrices with proper alignment of the singular spaces at initialization.

Chat is not available.

Poster in Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Implicit Regularization of Gradient Flow for One-layer Softmax Attention

Heejune Sheen · Siyu Chen · Tianhao Wang · Huibin Zhou

Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning