ScenDroid: A Scenario-Level Benchmark for Long-Horizon, Time-Evolving GUI Agents
Abstract
Recent breakthroughs in Vision Language Models (VLMs) have empowered GUI agents to perform discrete, short-horizon tasks. However, prevailing benchmarks predominantly rely on an Atomic Reset paradigm, which treats user interactions as isolated episodes, thereby failing to capture the continuous and state-dependent nature of real-world workflows. To bridge this gap, we introduce ScenDroid, which employs a novel App--Task--Scenario (ATS) architecture to orchestrate dependency-aware workflows (exceeding 1200 steps) across persistent Android environments spanning simulated days to weeks. Beyond operational execution, ScenDroid incorporates a Progressive Ambiguity Taxonomy and an integrated Interactive User Simulator to assess an agent’s capacity for proactive intent clarification and long-term preference alignment. Our extensive evaluation of state-of-the-art GUI agents reveals a catastrophic performance collapse, elucidating critical cognitive bottlenecks in structured episodic memory and closed-loop reasoning. We further deconstruct these failure modes and provide a strategic roadmap for developing ``Digital Agents" capable of persistent, autonomous, and human-aligned interaction. We release all scenarios, environment snapshots, agents, and evaluation data to catalyze research into these long-term challenges.