Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models

Chain-of-Thought Reasoning in the Wild is not Always Faithful

Iván Arcuschin · Jett Janiak · Robert Krzyzanowski · Senthooran Rajamanoharan · Neel Nanda · Arthur Conmy


Abstract:

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, this work focuses on showing that unfaithful CoT can occur on realistic prompts with no artificial bias. We show evidence for several forms of unfaithful CoT in production models such as Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash Thinking. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when given "Is X bigger than Y?" or "Is Y bigger than X?", models sometimes come up with superficially coherent arguments for why both answers are yes or no, despite this being logically impossible. However, we can only identify this unfaithfulness by examining the rollouts on both questions, as we cannot distinguish unfaithfulness from a model being faithful but wrong when monitoring a single rollout. We also investigate "restoration errors", where models make and then silently correct errors in their reasoning, which explain some, but not all unfaithful CoT. Overall, our findings suggest that CoT provides at best an incomplete picture of a model's reasoning process, even on naturally worded prompts. This raises challenges for AI safety work relying on monitoring CoT, to detect undesired behavior.

Chat is not available.