SenseAct: Structuring GUI Actions for Reliable Planning and Verification
Abstract
Reliable interaction with graphical user interfaces (GUIs) requires agents to make irreversible control commitments under partial observability, yet most existing GUI agents reduce this problem to step-by-step, perception-conditioned action prediction, leaving decisions implicit and execution unverifiable. We introduce SenseAct, a new paradigm for GUI agents that lifts low-level GUI observations (e.g., XML/DOM trees) into explicit control commitments specifying both controllable interface elements and their expected state transitions. In SenseAct, decision making is formulated over typed control primitives whose dependencies and ordering are explicitly represented, and whose execution is governed by programmatic post-condition predicates that define precise, observable success criteria. This formulation grounds task progress in symbolic state transitions, turning execution verification into a well-defined, deterministic decision rather than a language-based self-judgment. As a consequence of these explicit execution semantics, SenseAct can operate effectively without requiring pixel-level visual reasoning during normal execution and invokes the VLM only when symbolic state transitions violate expected control effects. On the challenging benchmarks DroidTask and AndroidLab, SenseAct achieves a reduction in VLM calls by 14.49% and 30.55% respectively while scoring higher success rates, demonstrating that internalizing programmatic verification within a closed loop control process effectively eliminates execution drift and enhances the efficiency of GUI representations.