Do Large Language Models Encode Instruction Overwrites?
Abstract
Large language models (LLMs) are vulnerable to prompt injection attacks that overwrite an original instruction, yet the internal mechanisms underlying this behavior remain unclear. Standard defenses rely on training-based safety alignment to enforce instruction hierarchies, providing little insight into the internal mechanisms of instruction overwriting. We present a representation-level analysis of instruction overwriting in LLMs. Using a payload-augmented benchmark and layer-wise linear probing, we show that LLMs encode a distinct internal representation that distinguishes the execution of the original task from the injected instruction. This signal generalizes across models, injection templates, payloads, and established benchmarks such as StruQ and representative subsets of IH-Bench. Building on this finding, we derive a one-dimensional activation steering vector that increases adherence to the original instruction, significantly reducing successful overwrites while preserving general instruction-following ability. Our approach increases adherence to original instructions under prompt injection without retraining. More generally, our results demonstrate that safety-relevant behaviors can be controlled through low-dimensional representation engineering.