ICLR Poster Towards Understanding Sycophancy in Language Models

Poster

Towards Understanding Sycophancy in Language Models

Mrinank Sharma · Meg Tong · Tomek Korbak · David Duvenaud · Amanda Askell · Sam Bowman · Esin DURMUS · Zac Hatfield-Dodds · Scott Johnston · Shauna Kravec · Timothy Maxwell · Sam McCandlish · Kamal Ndousse · Oliver Rausch · Nicholas Schiefer · Da Yan · Miranda Zhang · Ethan Perez

Halle B #129

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.

Chat is not available.