Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It
Rajarshi Ghoshal ⋅ Salma Abdelhalim ⋅ Debadri Basak ⋅ Pratibha arora
Abstract
Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in LLMs. We present a controlled study of when this reasoning augmentation helps versus hurts in code generation across three architectures: a 2×2 design with Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus a preliminary evaluation of CodeLlama-7B. Our key finding is that instruction tuning reverses CoT's effect on the same base architecture: CoT significantly improves Qwen base (+13.4%, $p<0.001$) but significantly degrades Qwen instruct ($-15.2\%$, $p<0.001$). DeepSeek remains insensitive regardless of training regime (all $p>0.2$), demonstrating architecture-specific sensitivity. Layer-wise probing reveals all four models encode prompt type by Layer 1–4 (>90% accuracy) — yet this universal early encoding drives divergent downstream behavior. Representation does not determine interpretation: training regime does. Building on this, we develop a probe-guided style router that selects from 12 prompt styles per problem via a single forward pass (84ms overhead). The router is statistically indistinguishable from the best fixed style in 7/8 settings (McNemar's, all $p_{\text{Best}}>0.1$) and significantly outperforms CoT where CoT is most harmful ($p=0.012$, $h=+0.40$). CoT prompting should not be applied blindly — its effect is mechanistically detectable from early-layer activations.
Chat is not available.
Successful Page Load