Invited Talk
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Invited Talk #1 - Bridging the Gap Between Pre-training Data and Alignment [Speaker: Mike Lewis (Meta AI)]
Mike Lewis
Title: Bridging the Gap Between Pre-training Data and Alignment
Intro: Large Language Models are first pre-trained to predict unlabelled web documents, but then typically used in scenarios such as few shot learning and instruction following - leaving a significant gap between training and inference. I will describe two scalable approaches to reducing this gap, by augmenting pre-training data with additional structure. First I will introduce In Context Pre-training, which mimics in context learning during pre-training, by training language models on sequences of closely related documents. Then, I will describe Instruction Backtranslation, an approach to augmenting web documents with automatically inferred instructions for generating them, allowing a highly scalable source of instruction following data. Together, these can improve the utility of base models, and reduce the need for a separate alignment phase of training.
Bio: Mike Lewis is a research scientist at Meta AI, and is currently the pre-training lead for Llama3. Prior projects include the Cicero Diplomacy agent, and the Bart and Roberta pretrained language models. Previously he was a postdoc at the University of Washington (working with Luke Zettlemoyer), and has a PhD from the University of Edinburgh (advised by Mark Steedman). He received a Best Paper Award at EMNLP 2016, Best Resource Paper at ACL 2017, and Best Paper Honourable Mention at ACL 2018. His work has been extensively covered in the media, with varying levels of accuracy.