A Framework for Formalizing LLM Agent Security
Abstract
Existing definitions of agent security fail to capture the contextual nature of authorization: the same action can represent legitimate behavior or a security violation depending on who commanded it, what objective is pursued, and whether the action serves that objective. Current approaches treat actions as inherently malicious based on content patterns, forcing defenses into utility-security tradeoffs where blocking suspicious patterns prevents legitimate use while permissive filtering enables attacks. We present a systematization of knowledge decomposing agent security into four properties: task alignment (pursuing authorized objectives), action alignment (individual actions serving those objectives), authorized instruction following (commands from authenticated sources), and data isolation (information flows respecting privilege boundaries). This systematization introduces oracle functions that formalize security verification, revealing why agent security requires causal attribution and provenance tracking capabilities beyond traditional security systems. We reformalize existing attacks (indirect prompt injection, task drift, capability misuse, memory poisoning) through property-level analysis, showing that superficially similar behaviors may violate different properties and require distinct defenses. We analyze existing defenses, revealing fundamental limitations: pattern matching cannot distinguish context-dependent authorization, defenses targeting single properties remain vulnerable to attacks violating multiple properties, and current benchmarks miss temporal violations by resetting context between evaluations. Our systematization provides vocabulary for precise vulnerability classification, identifies coverage gaps in existing defenses, and reveals open challenges including oracle function approximation and compositional security verification.