ICLR Poster On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Poster

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Stanislaw Jastrzebski · Zachary Kenton · Nicolas Ballas · Asja Fischer · Yoshua Bengio · Amos Storkey

Great Hall BC #69

Keywords: [ optimization ] [ sgd ] [ generalization ] [ theory of deep learning ] [ hessian ]

[ Abstract ]

Abstract:

The training of deep neural networks with Stochastic Gradient Descent (SGD) with a large learning rate or a small batch-size typically ends in flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. This was found to correlate with a good final generalization performance. In this paper we extend previous work by investigating the curvature of the loss surface along the whole training trajectory, rather than only at the endpoint. We find that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. At this peak value SGD starts to fail to minimize the loss along directions in the loss surface corresponding to the largest curvature (sharpest directions). To further investigate the effect of these dynamics in the training process, we study a variant of SGD using a reduced learning rate along the sharpest directions which we show can improve training speed while finding both sharper and better generalizing solution, compared to vanilla SGD. Overall, our results show that the SGD dynamics in the subspace of the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.

Live content is unavailable. Log in and register to view live content