Training large neural networks requires distributing learning over multiple workers. The rate limiting step is often in sending gradients from workers to parameter server and back again. We present signSGD with majority vote: the first gradient compression scheme to achieve 1-bit compression of worker-server communication in both directions with non-vacuous theoretical guarantees. To achieve this, we build an extensive theory of sign-based optimisation, which is also relevant to understanding adaptive gradient methods like Adam and RMSprop. We prove that signSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the L1 versus L2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Live content is unavailable. Log in and register to view live content