PAMDP: Interact to Persona Alignment via a Partially Observable Markov Decision Process
Abstract
The interaction process of comprehending user-specific nuances and adapting to their preferences represents a pivotal consideration for Persona Large Language Models, as it more authentically mirrors genuine dialogue dynamics than adherence to general human value alignment. In this paper, we conceptualize this ``Interact to Persona Alignment'' challenge as a Partially Observable Markov Decision Process, abbreviated as PAMDP, wherein the user’s dynamically evolving profile through interaction is treated as an unobservable variable to the assistant. Grounded in this formulation, we propose a dual-critic reinforcement learning framework, with a continuous latent space action representing the assistant’s utterance. We evaluate our approach on both offline datasets and the online simulator, ultimately demonstrating its effectiveness.