CAP: A Scalable Benchmark for Evaluating Cross-Site Browser Agents with Complex Actions and Perception
Abstract
Large language models are increasingly deployed as autonomous agents that interact with the web through browsers. While recent progress has been driven by benchmarks that evaluate end-to-end task success, these evaluations largely overlook two fundamental sources of difficulty in real web browsing: complex actions over rich user interfaces and visual perception of dynamically rendered content, especially in workflows that span multiple websites. We introduce CAP, a scalable benchmark for evaluating browser agents on cross-site, human-like web tasks that require non-trivial UI interactions and visual understanding. Specifically, we adopt decomposition-recomposition pipeline to first abstracts each website into a structured site card, capturing user-facing functions, complex execution operations, and perceptual requirement, and then recomposes these components into realistic cross-site workflows. Therefore, each task is grounded into specific operation in each website, enabling fine-grained diagnosis. Build on this framework, we successfully construct 420 tasks across 108 real-world websites and 24 domains, with carefully quality control. Experiments on state-of-the-art browser agents show low success rates using our newly proposed verifiable agent-as-a-judge evaluation framework and reveal that perception-heavy interactions remain a major bottleneck, highlighting substantial gaps between current agents and real-world web browsing demands.