Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
On the Relationship Between Small Initialization and Flatness in Deep Networks
Soo Min Kwon · Lijun Ding · Laura Balzano · Qing Qu
In this work, we investigate the relationship between small initialization and flatness in learning deep networks. We empirically observe that, while all initialization scales lead to minima with zero training error, those obtained from a smaller initialization scale with stochastic gradient descent often yield solutions with superior generalization capabilities. Our empirical results suggest that these particular solutions are flatter, as measured by the trace of the Hessian, hinting that small initialization induces an implicit bias towards flat minima, which supports improved generalization. We conduct experimental studies validating this claim on the simplest class of overparameterized models: deep linear networks for low-rank matrix recovery tasks. Here, we also discuss the role of depth and demonstrate how finding such flat solutions can be beneficial for learning in the presence of noise. Lastly, we conduct experiments on deep nonlinear networks for solving classification tasks, showing that the phenomenon similarly holds.