Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Optimization Effectiveness versus Generalization Capability of Stochastic Optimization Algorithms for Deep Learning
TOKI TAHMID INAN · Mingrui Liu · Amarda Shehu
The rich deep learning optimization literature reflects our fragmented understanding of what makes a good optimizer and, more importantly, whether improved optimization performance confers higher generalizability. Current literature neglects an important innate characteristic of SGD and variants, their stochasticity, failing to properly benchmark these algorithms and so reveal their performance in the statistical sense. We fill this gap in this paper. Unlike existing work which evaluates the end point of one navigation/optimization trajectory, we utilize and sample from the ensemble of several optimization trajectories, so that we can estimate the stationary distribution of a stochastic optimizer. We cast a wide net and include SGD and noise-enabled variants, flat-minima optimizers, as well as new algorithms we debut in this paper by recasting noise-enabled optimizers under the Basin Hopping framework. Our evaluation considers both synthetic functions with known global and local minima of varying flatness and real-world problems in computer vision and natural language processing. Our benchmarking accounts for the statistical setting, comparing populations of models and testing for statistical significance. Our paper reveals several findings on the relationship between training loss and hold-out accuracy, the comparable performance of SGD, noise-enabled variants, and novel optimizers based on the BH framework; indeed, these algorithms match the performance of flat-minima optimizers like SAM with half the gradient evaluations. We hope that this work will support further research that accounts for the stochasticity of optimizers for deep learning.