Poster
in
Workshop: Agents in the Wild: Safety, Security, and Beyond

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das ⋅ Luca Beurer-Kellner ⋅ Marc Fischer ⋅ Maximilian Baader

Project Page [ Poster] [ OpenReview]

Abstract

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Current defenses struggle with context-dependent attacks and cannot reliably differentiate malicious from benign instructions, leading to high false positive rates that prevent real-world adoption. To address this, we present a novel approach inspired by a fundamental principle of computer security: data should not contain executable instructions. Instead of classifying tool outputs as malicious or benign—a hard, low-resource, and ill-defined objective—we reframe the sanitization task entirely. We propose a token-level sanitization process that surgically removes any instructions directed at AI systems from untrusted tool outputs, capturing malicious instructions as a byproduct without needing to judge maliciousness. While this over-approximation may appear restrictive, our experiments show that it does not harm agent utility. In contrast to existing safety classifiers, this approach is non-blocking, requires no calibration, and is context-agnostic. Further, we train such token-level predictors using only readily available instruction-tuning data, without relying on unrealistic prompt injection examples. Across a wide range of attacks and benchmarks, including AgentDojo, BIPIA, InjecAgent, ASB, and SEP, our method achieves a 7–19× reduction in attack success rate (from 34% to 3% on AgentDojo) without impairing agent utility in either benign or malicious settings.

Chat is not available.