ICLR Poster Causally Motivated Sycophancy Mitigation for Large Language Models

Poster

Causally Motivated Sycophancy Mitigation for Large Language Models

Haoxi Li · Xueyang Tang · Jie ZHANG · Song Guo · Sikai Bai · Peiran Dong

Hall 3 + Hall 2B #303

[ Abstract ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Incorporating user preferences into large language models (LLMs) can enhance the personalization and reliability of model outputs and facilitate the application of LLMs to real-world scenarios. However, leveraging user preferences can be a double-edged sword. Recent studies have found that improper utilization can incur sycophancy, where LLMs prioritize alignment with user preferences over the correctness of their outputs. To address sycophancy in LLMs, we analyze and model the problem through the lens of structured causal models (SCMs). We attribute sycophancy to LLMs' reliance on spurious correlations between user preferences and model outputs in this paper. Based on the proposed SCMs, we develop a novel framework, termed CAUSM, to mitigate sycophancy in LLMs by exploiting a significant causal signature. Specifically, we eliminate the spurious correlations embedded in the intermediate layers of LLMs through causally motivated head reweighting, and then calibrate the intra-head knowledge along the causal representation direction. Extensive experiments are conducted across diverse language tasks to demonstrate the superiority of our method over state-of-the-art competitors in mitigating sycophancy in LLMs.

Live content is unavailable. Log in and register to view live content