Poster
in
Workshop: Lifelong Agents: Learning, Aligning, Evolving Sun, Apr 26, 2026 • 11:00 AM – 12:00 PM PDT

CAP: A Scalable Benchmark for Evaluating Cross-Site Browser Agents with Complex Actions and Perception

XuZejun ⋅ Taiyi Chen ⋅ Jin Li ⋅ Yongtong Gu ⋅ QiCheng ⋅ Lvaixuan ⋅ Zhu shuai ⋅ ZhuPengfei ⋅ Kaichen Yang ⋅ Sun Boyu ⋅ YixianYang ⋅ Mulong Xie ⋅ Xiaoteng Ma ⋅ Hongru WANG

Project Page [ OpenReview]

Abstract

Large language models are increasingly deployed as autonomous agents that interact with the web through browsers. While recent progress has been driven by benchmarks that evaluate end-to-end task success, these evaluations largely overlook two fundamental sources of difficulty in real web browsing: complex actions over rich user interfaces and visual perception of dynamically rendered content, especially in workflows that span multiple websites. We introduce CAP, a scalable benchmark for evaluating browser agents on cross-site, human-like web tasks that require non-trivial UI interactions and visual understanding. Specifically, we adopt decomposition-recomposition pipeline to first abstracts each website into a structured site card, capturing user-facing functions, complex execution operations, and perceptual requirement, and then recomposes these components into realistic cross-site workflows. Therefore, each task is grounded into specific operation in each website, enabling fine-grained diagnosis. Build on this framework, we successfully construct 420 tasks across 108 real-world websites and 24 domains, with carefully quality control. Experiments on state-of-the-art browser agents show low success rates using our newly proposed verifiable agent-as-a-judge evaluation framework and reveal that perception-heavy interactions remain a major bottleneck, highlighting substantial gaps between current agents and real-world web browsing demands.

Chat is not available.