Toward Verifiably Steerable Language Models
Abstract
Post-training is the key to making large language models useful: it shapes how models respond to instructions, align with human intent, and generalize across diverse tasks. This talk addresses the challenge of developing steerable AI through post-training. I will discuss how we can train models to be better instruction followers. And I will show that most models severely overfit on a small set of instruction-following constraints and are not able to generalize well to unseen output constraints. I propose to train models with reinforcement learning from verifiable rewards for verifiable instruction following, and show how this leads to improved generalization on constraint following. Throughout the presentation, I will outline how I have applied these insights into developing open generative models, like Tülu and OLMo, and I will conclude with an outlook on how we can make AI more steerable in the future.