GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments?
Abstract
Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile Graphical User Interfaces (GUIs). However, their operation within dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike traditional prompt-based attacks that manipulate textual instructions, environmental injection contaminates the agent's visual perception by inserting adversarial UI elements, such as deceptive overlays or spoofed notifications, directly into the GUI. This bypasses textual safeguards and can derail agent execution, leading to privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark dedicated to assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, our benchmark injects adversarial events into realistic application workflows inside fully operational Android emulators, assessing agent performance across a range of critical risk scenarios. We also introduce a novel evaluation protocol where a judge LLM performs fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding sequence of screenshots. This protocol identifies the precise point of failure, whether in perception, recognition, or reasoning. Our comprehensive evaluation of state-of-the-art agents reveals their profound vulnerability to deceptive environmental cues. The results demonstrate that current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides an essential framework for quantifying and mitigating this emerging threat, paving the way for the development of more robust and secure embodied agents.