Skip to yearly menu bar Skip to main content


Poster

Grounding Multimodal Large Language Model in GUI World

Weixian Lei · Difei Gao · Mike Zheng Shou

[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Recent advancements in Multimodal Large Language Models (MLLMs) have accelerated the development of Graphical User Interface (GUI) agents capable of automating complex tasks across digital platforms. However, precise GUI element grounding remains a key challenge for accurate interaction and generalization. In this work, we present an effective GUI grounding framework, which includes an automated data collection engine that gathers extensive GUI screenshots and annotations to ensure broad generalization. We also propose a lightweight and flexible GUI grounding module designed to efficiently localize UI elements by pre-training on the collected data, and introduce a novel method to integrate this module with MLLMs for the effective execution of GUI tasks. Our approach demonstrates superior performance in task accuracy and adaptability, as validated by benchmarks such as ScreenSpot, MiniWob, AITW, and Mind2Web.

Live content is unavailable. Log in and register to view live content