ICLR Poster Toward Understanding the Impact of Staleness in Distributed Machine Learning

Poster

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Wei Dai · Yi Zhou · Nanqing Dong · Hao Zhang · Eric Xing

Great Hall BC #52

[ Abstract ]

Abstract:

Most distributed machine learning (ML) systems store a copy of the model parameters locally on each machine to minimize network communication. In practice, in order to reduce synchronization waiting time, these copies of the model are not necessarily updated in lock-step, and can become stale. Despite much development in large-scale ML, the effect of staleness on the learning efficiency is inconclusive, mainly because it is challenging to control or monitor the staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best-known convergence rate of O(1/\sqrt{T}).

Live content is unavailable. Log in and register to view live content