Invited Talk 2: Natasha Jaques (LLM Safety is a Multi-agent Problem)
Abstract
Online invited talk from Natasha Jaques
In spite of the fact that over 1 billion people are currently using LLMs on a weekly basis, we still have no guarantees that these models are actually safe. They can tell us how to build a bomb or a bioweapon, say something inappropriate, and or even try to persuade or manipulate their users. I will argue that all of these issues arise because current safety training paradigms treat the user as a static environment, when in fact users are dynamic, non-stationary, best-responding agents. In short, a multi-agent framework is necessary to improve model safety. I’ll discuss recent work in which we use online multi-agent reinforcement learning to develop a red-teaming procedure that can produce a provably safe LLM, and leads to large empirical safety gains in practice. Further, we will show an analysis of how LLMs distort human writing and communication, and I will argue this also results from training with RL from human feedback while failing to consider the user as a dynamic agent. Methods that are capable of adapting online to the user may provide a path for mediating these effects. Taken together, these results build the case for a multi-agent treatment of AI safety.