Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization
Chaewon Moon · Dongkuk Si · Chulhee Yun
Abstract
We study the implicit bias of sharpness-aware minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically—even on a single-example dataset where we can analyze the dynamics. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $0$ or to any standard basis vector; this is in stark contrast to GD, whose limit aligns with the basis vector of the dominant coordinate in the data. For $\ell_2$-SAM, we uncover a phenomenon we call *sequential feature discovery*, in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization grows. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM’s gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later. Synthetic and real-data experiments corroborate our findings.
Successful Page Load