Efficient Distributed Optimization under Heavy-Tailed Noise
Su Lee · Manzil Zaheer · Tian Li
Abstract
Distributed optimization is essential for scaling modern machine learning, yet communication overhead remains a challenge. Local updates reduce this cost but introduce a nested optimization structure, where heavy-tailed gradient noise--especially in attention-based models--impairs convergence. We propose TailOPT, a framework leveraging adaptive optimization and clipping to address heavy-tailed noise, with convergence guarantees under unbounded stochastic gradient variance and local updates. Among its variants, we introduce $Bi^2Clip$, which applies coordinate-wise clipping at both inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the overhead of maintaining or transmitting preconditioners. Empirically, TailOPT, including $Bi^2Clip$, outperforms state-of-the-art methods across multiple language tasks and models.
Video
Chat is not available.
Successful Page Load