Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

Efficient Distributed Optimization under Heavy-Tailed Noise

Su Lee · Manzil Zaheer · Tian Li

Keywords: Adaptive Optimization Distributed Optimization Scalable Algorithms

Project Page [ OpenReview]

Abstract

Distributed optimization is essential for scaling modern machine learning, yet communication overhead remains a challenge. Local updates reduce this cost but introduce a nested optimization structure, where heavy-tailed gradient noise--especially in attention-based models--impairs convergence. We propose TailOPT, a framework leveraging adaptive optimization and clipping to address heavy-tailed noise, with convergence guarantees under unbounded stochastic gradient variance and local updates. Among its variants, we introduce $Bi^2Clip$, which applies coordinate-wise clipping at both inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the overhead of maintaining or transmitting preconditioners. Empirically, TailOPT, including $Bi^2Clip$, outperforms state-of-the-art methods across multiple language tasks and models.

Video

Chat is not available.