Characterizing Backtracking in CoT through Internal Probes and Surface-Level Features
Abstract
Chain-of-thought (CoT) traces from reasoning models often include revisions of intermediate reasoning steps, a behavior we term backtracking. We explore when and why backtracking occurs in reasoning. Using an automated annotation pipeline, we find that backtracking is rare (3-10\% of reasoning chunks) and highly autocorrelated. We further compare surface-level predictors with linear probes on hidden states to identify features predictive of backtracking. While surface features provide substantial signal (ROC-AUC up to 0.80), hidden-state probes prove superior for both detecting current backtracking and predicting its onset in the next step (TPR@5%FPR up to 0.47). Our results indicate that backtracking reflects a structured internal regime during generation rather than merely superficial linguistic cues.