Poster
in
Workshop: Self-Improving Foundation Models Without Human Supervision

Preference Tree Optimization: Enhancing Goal-Oriented Dialogue with Look-Ahead Simulations

Lior Baruch · Moshe Butman · Kfir Bar · Doron Friedman

Keywords: Direct Preference Optimization (DPO) Look-Ahead Simulation Synthetic Data Generation Goal-Oriented Dialogue Conversational AI Multi-Turn Dialogue Preference-Based Learning Self-Improving AI Reinforcement Learning (RL) Counseling AI Motivational Interviewing (MI) Interactive AI Training Dialogue Systems Decision-Making in AI Preference Tree Optimization (PTO) Oracle Evaluation Human-AI Interaction Language Models (LLMs) Virtual Patients

Project Page [ OpenReview]

Abstract

Developing dialogue systems capable of engaging in multi-turn, goal-oriented conversations remains a significant challenge, especially in specialized domains with limited data. This research proposes a novel framework called \textit{Preference Tree Optimization (PTO)}, designed to iteratively improve agent models in such dialogue systems, by generating preference data using a method called Preference Tree with Look-Ahead. Focusing on Motivational Interviewing (MI)—a counseling technique aimed at facilitating behavioral change—we leverage virtual patients and an oracle evaluator to simulate conversations and generate rich preference datasets. By combining this method with Direct Preference Optimization (DPO), we aim to enhance the agent's decision-making capabilities over iterative training cycles. The proposed framework addresses data scarcity and advances the development of more nuanced and effective dialogue systems in goal-oriented domains.Experimental evaluations demonstrate that the PTO framework enhances dialogue agents' performance in goal-oriented conversations within the domain of Motivational Interviewing (MI). Models trained with PTO consistently outperformed the baseline in key metrics such as session satisfaction and working alliance. Additionally, incorporating look-ahead simulations led to improved long-term planning and more effective conversational strategies, with deeper look-ahead configurations yielding the most stable and high-scoring results.

Chat is not available.