Invited Talk
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Invited Talk #5 - Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models [Speaker: Luke Zettlemoyer (U Washington/Meta)]
Luke Zettlemoyer
Title: Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Abstract: Existing language model (LM) training regimes entangle compute, data, and parameters, requiring expensive synchronous communication with massive supercomputers. This talk introduces a new algorithm called Branch-Train-Merge (BTM) that asynchronously trains LMs that are fundamentally modular. In BTM, components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. We show how BTM enables LMs that are rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to our proposal is exploring what constitutes the domains to which experts specialize, as well as reflecting on the data sources used to train LMs. Our new techniques chart a path towards collaborative and iterative LM development, where anyone can contribute and maintain experts at modest computational cost.
This talk describes work done at the University of Washington and Meta, primarily led by Suchin Gururangan and Margaret Li, along with many other collaborators.