Poster
in
Workshop: Lifelong Agents: Learning, Aligning, Evolving Sun, Apr 26, 2026 • 11:00 AM – 12:00 PM PDT

ScenDroid: A Scenario-Level Benchmark for Long-Horizon, Time-Evolving GUI Agents

Zhe Wu ⋅ Yongxin Kang ⋅ Dabin Sheng ⋅ Junliang Xing ⋅ Guokun Wu ⋅ Derek Yuen ⋅ Donglin Mo ⋅ Yuheng Jing ⋅ Kai Li ⋅ Weilin Luo ⋅ Kun Shao ⋅ Yuanchun Shi

Project Page [ OpenReview]

Abstract

Recent breakthroughs in Vision Language Models (VLMs) have empowered GUI agents to perform discrete, short-horizon tasks. However, prevailing benchmarks predominantly rely on an Atomic Reset paradigm, which treats user interactions as isolated episodes, thereby failing to capture the continuous and state-dependent nature of real-world workflows. To bridge this gap, we introduce ScenDroid, which employs a novel App--Task--Scenario (ATS) architecture to orchestrate dependency-aware workflows (exceeding 1200 steps) across persistent Android environments spanning simulated days to weeks. Beyond operational execution, ScenDroid incorporates a Progressive Ambiguity Taxonomy and an integrated Interactive User Simulator to assess an agent’s capacity for proactive intent clarification and long-term preference alignment. Our extensive evaluation of state-of-the-art GUI agents reveals a catastrophic performance collapse, elucidating critical cognitive bottlenecks in structured episodic memory and closed-loop reasoning. We further deconstruct these failure modes and provide a strategic roadmap for developing ``Digital Agents" capable of persistent, autonomous, and human-aligned interaction. We release all scenarios, environment snapshots, agents, and evaluation data to catalyze research into these long-term challenges.

Chat is not available.