Oral
in
Workshop: How Far Are We From AGI
AI Alignment with Changing and Influenceable Reward Functions
Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan
Keywords: [ AI alignment ] [ Changing Preferences ]
Current AI alignment techniques treat human preferences as static and model them via a single reward function.However, our preferences change, making the goal of alignment ambiguous: should AI systems act in the interest of our current, past, or future selves?The behavior of AI systems may also \emph{influence} our preferences, meaning that notions of alignment must also specify which kinds of influence are---and are not---acceptable. The answers to these questions are left undetermined by the current AI alignment paradigm, making it ill-posed.To ground formal discussions of these issues, we introduce Dynamic Reward MDPs (DR-MDPs), which extend MDPs to allow for the reward function to change and be influenced by the agent. Using the lens of DR-MDPs, we demonstrate that agents trained with current alignment techniques will have \textit{incentives for influence}---that is, they will systematically attempt to shift our future preferences to make them easier to satisfy. We also investigate how one may avoid undesirable influence by leveraging the optimization horizon used or by using different DR-MDP optimization objectives which correspond to alternative notions of alignment.Broadly, our work highlights the unintended consequences of applying current alignment techniques to settings with changing and influenceable preferences, and describes the challenges that must be overcome to develop a more general AI alignment paradigm which can accommodate such settings.