ICLR 2018
Skip to yearly menu bar Skip to main content


Workshop

Finding Flatter Minima with SGD

Devansh Arpit

East Meeting Level 8 + 15 #12

It has been discussed that over-parameterized deep neural networks (DNNs) trained using stochastic gradient descent (SGD) with smaller batch sizes generalize better compared with those trained with larger batch sizes. Additionally, model parameters found by small batch size SGD tend to be in flatter regions. We extend these empirical observations and experimentally show that both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well. Conversely, we find that small learning rates and large batch sizes lead to sharper minima that correlate with poor generalization in DNNs.

Live content is unavailable. Log in and register to view live content