Poster
Efficient stagewise pretraining via progressive subnetworks
Abhishek Panigrahi · Nikunj Saunshi · Kaifeng Lyu · Sobhan Miryoosefi · Sashank J. Reddi · Satyen Kale · Sanjiv Kumar
Hall 3 + Hall 2B #584
Recent developments in large language models have sparked interest in efficientpretraining methods. Stagewise training approaches to improve efficiency, likegradual stacking and layer dropping (Reddi et al., 2023; Zhang & He, 2020), haverecently garnered attention. The prevailing view suggests that stagewise droppingstrategies, such as layer dropping, are ineffective, especially when compared tostacking-based approaches. This paper challenges this notion by demonstratingthat, with proper design, dropping strategies can be competitive, if not better, thanstacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within themodel and progressively increases the size of subnetworks during training, until ittrains the full network. We propose an instantiation of this framework — RandomPart Training (RAPTR) — that selects and trains only a random subnetwork (e.g.depth-wise, width-wise) of the network at each step, progressively increasing thesize in stages. We show that this approach not only generalizes prior works likelayer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layerdropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Throughcomprehensive experiments, we demonstrate that RAPTR can significantly speedup training of standard benchmarks like BERT and UL2, up to 33% compared tostandard training and, surprisingly, also shows better downstream performance onUL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidenceof better inductive bias.
Live content is unavailable. Log in and register to view live content