ICLR Poster Efficient stagewise pretraining via progressive subnetworks

Poster

Efficient stagewise pretraining via progressive subnetworks

Abhishek Panigrahi · Nikunj Saunshi · Kaifeng Lyu · Sobhan Miryoosefi · Sashank J. Reddi · Satyen Kale · Sanjiv Kumar

Hall 3 + Hall 2B #584

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Recent developments in large language models have sparked interest in efficientpretraining methods. Stagewise training approaches to improve efficiency, likegradual stacking and layer dropping (Reddi et al., 2023; Zhang & He, 2020), haverecently garnered attention. The prevailing view suggests that stagewise droppingstrategies, such as layer dropping, are ineffective, especially when compared tostacking-based approaches. This paper challenges this notion by demonstratingthat, with proper design, dropping strategies can be competitive, if not better, thanstacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within themodel and progressively increases the size of subnetworks during training, until ittrains the full network. We propose an instantiation of this framework — RandomPart Training (RAPTR) — that selects and trains only a random subnetwork (e.g.depth-wise, width-wise) of the network at each step, progressively increasing thesize in stages. We show that this approach not only generalizes prior works likelayer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layerdropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Throughcomprehensive experiments, we demonstrate that RAPTR can significantly speedup training of standard benchmarks like BERT and UL2, up to 33% compared tostandard training and, surprisingly, also shows better downstream performance onUL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidenceof better inductive bias.

Live content is unavailable. Log in and register to view live content