Poster
in
Workshop: Modular, Collaborative and Decentralized Deep Learning

Beyond Top-K: Structured Sparsification for Compression in Pipeline Parallel

Sameera Ramasinghe · Thalaiyasingam Ajanthan · Gil Avraham · Yan Zuo · Alexander Long

Abstract

In decentralized training, efficient communication is critical, particularly when training large-scale models over low-bandwidth, heterogeneous networks. Although gradient compression techniques have proven effective in Distributed Data-Parallel (DDP) settings, extending them to pipeline parallel (PP) training is challenging due to cumulative compression errors that exacerbate with network depth. In this work, we introduce a novel compression framework for PP that preserves the column space of activations and gradients instead of compressing individual elements. We derive tight theoretical error bounds and demonstrate the effectiveness of our method by training models over 80 Mbps connections, achieving up to 90\% compression along with around $2 \times$ training and $12 \times$ inference throughput improvements.

Chat is not available.