Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS
Efficient Distributed Optimization under Heavy-Tailed Noise
Su Lee · Manzil Zaheer · Tian Li
Keywords: [ Adaptive Optimization ] [ Distributed Optimization ] [ Scalable Algorithms ]
Abstract:
Distributed optimization is essential for scaling modern machine learning, yet communication overhead remains a challenge. Local updates reduce this cost but introduce a nested optimization structure, where heavy-tailed gradient noise--especially in attention-based models--impairs convergence. We propose TailOPT, a framework leveraging adaptive optimization and clipping to address heavy-tailed noise, with convergence guarantees under unbounded stochastic gradient variance and local updates. Among its variants, we introduce $Bi^2Clip$, which applies coordinate-wise clipping at both inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the overhead of maintaining or transmitting preconditioners. Empirically, TailOPT, including $Bi^2Clip$, outperforms state-of-the-art methods across multiple language tasks and models.
Chat is not available.
Successful Page Load