Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li · Meng ziyang · Hongzhan Lin · Ziyang Luo · Yuchen Tian · Jing Ma · Zhiyong Huang · Tat-Seng Chua

Project Page [ OpenReview]

Abstract

Multi-modal large language models (MLLMs) are rapidly advancing in visual understanding and reasoning, enhancing GUI agents for tasks such as web browsing and mobile interactions. However, these agents depend on reasoning skills for action planning but only rely on the model capability for UI grounding (localizing the target element). These grounding models struggle with high-resolution displays, small targets, and complex environments. In this work, we introduce a novel method to improve MLLMs’ grounding performance in high-resolution, complex UI environments using a visual search approach based on visual reasoning. Additionally, we create a new benchmark, dubbed ScreenSpot-Pro, designed to comprehensively evaluate model capabilities in professional high-resolution settings. This benchmark consists of real-world high-resolution images and expert-annotated tasks from diverse professional domains. Our experiments show that existing GUI grounding models perform poorly on this dataset, with the best achieving only 18.9\%, whereas our visual-reasoning strategy significantly improves performance, reaching 48.1\% without any additional training.

Chat is not available.