Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS
OPPA: OPtimizing PArallelism for Language Model Training
Apivich Hemachandra · Yizhan Han · See-Kiong Ng · Bryan Kian Hsiang Low
Keywords: [ parallelized training ] [ bayesian optimization ] [ neural network training ]
Training of modern large neural networks (NNs) is often done in parallel across multiple GPUs. While there are existing parallel training frameworks which easily allow NN training using multi-dimensional parallelism, the challenge remains in optimizing the balance between size of the parallelism dimensions, and in tuning the hyperparameters within each parallelism dimension. Due to a large number of possible parallelism configurations (PCs) for a given training process, it is infeasible to perform exhaustive search over all candidates. Even though there exists PC optimization methods, they either rely on an approximate cost model which may be inaccurate and hardware-specific, or on a large number of NN training trials on different PCs each which are expensive to evaluate. To overcome these issues, we present OPPA, which combines Bayesian optimization with prior knowledge in the form of a parallelism-informed prior belief, to obtain an optimal PC using minimal number of NN training trials. We demonstrate that OPPA is able to more efficiently find an optimal PC for training transformers when compared to methods used in existing parallel training frameworks.