Poster
in
Workshop: Representational Alignment

Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Giovanni Occhipinti ⋅ Alessandro Abate ⋅ Nandi Schoots

Project Page [ OpenReview]

Abstract

Chain-of-Thought (CoT) explanations are essential for the monitoring and safety of Large Language Models (LLMs), yet they are susceptible to unfaithful rational- ization that could obfuscate dangerous behaviors. While prior work has focused on black-box methods, the internal mechanisms and white-box control of faithfulness remain under-explored. In this paper, we employ representation engineering to investigate the latent geometry of faithfulness in thinking language models. We demonstrate that instances of faithfulness are, to some extent, encoded as linear directions in middle-to-late layers, as shown by successful probing and steering of the models’ internals. We find that linear steering interventions achieve monitora- bility recovery rates up to 46% with collateral effects below 5%. We also find that off-policy steering methods have comparable utility to on-policy approaches.

Chat is not available.