CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
Abstract
The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Current defenses struggle with context-dependent attacks and cannot reliably differentiate malicious from benign instructions, leading to high false positive rates that prevent real-world adoption. To address this, we present a novel approach inspired by a fundamental principle of computer security: data should not contain executable instructions. Instead of classifying tool outputs as malicious or benign—a hard, low-resource, and ill-defined objective—we reframe the sanitization task entirely. We propose a token-level sanitization process that surgically removes any instructions directed at AI systems from untrusted tool outputs, capturing malicious instructions as a byproduct without needing to judge maliciousness. While this over-approximation may appear restrictive, our experiments show that it does not harm agent utility. In contrast to existing safety classifiers, this approach is non-blocking, requires no calibration, and is context-agnostic. Further, we train such token-level predictors using only readily available instruction-tuning data, without relying on unrealistic prompt injection examples. Across a wide range of attacks and benchmarks, including AgentDojo, BIPIA, InjecAgent, ASB, and SEP, our method achieves a 7–19× reduction in attack success rate (from 34% to 3% on AgentDojo) without impairing agent utility in either benign or malicious settings.