Chain-of-Thought Injection as an Inference-Time Safety Intervention
Abstract
As large language models are increasingly deployed in agentic settings, ensuring that their reasoning remains aligned with intended goals becomes a critical challenge for AI control. While established control techniques such as monitoring and sandboxing primarily operate on model inputs and outputs, the extended reasoning traces of modern LLMs represent an underexplored intervention surface. In this work, we propose Chain-of-Thought injection as a lightweight, inferencetime control mechanism, in which a trusted overseer inserts brief safety reminders into the reasoning trace of an untrusted model to redirect misaligned reasoning before it compounds. We evaluate this approach on the MMLU and MATH500 benchmarks across two models — DeepSeek-R1-Distilled-Qwen-14B and Qwen3-235B — using prompt-based attacks that induce systematically incorrect outputs as a proxy for misaligned behaviour. Our results suggest that injection position is an important determinant of effectiveness: interventions applied early in the reasoning trace consistently produce stronger safety outcomes, suggesting that misaligned reasoning becomes progressively harder to disrupt once established. We further show that injection content influences effectiveness, and find preliminary evidence that injections which the model finds less surprising tend to produce better outcomes. Together, these findings provide initial evidence that strategically timed, lightweight interventions in the reasoning trace may offer a useful complement to existing AI control approaches.