Poster
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu · Tianjie Zhang · Yu Gu · Iat Long Iong · Song XiXuan · Yifan Xu · Shudan Zhang · Hanyu Lai · Jiadai Sun · Xinyue Yang · Yu Yang · Zehan Qi · Shuntian Yao · Xueqiao Sun · Siyi Cheng · Qinkai Zheng · Hao Yu · Hanchen Zhang · Wenyi Hong · Ming Ding · Lihang Pan · Xiaotao Gu · Aohan Zeng · Zhengxiao Du · Chan Hee Song · Yu Su · Yuxiao Dong · Jie Tang
Hall 3 + Hall 2B #212
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable \textbf{Visual Foundation Agents} that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs as visual foundation agents in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and unified benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios in one standard setting, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models. Additionally, VAB explores the synthesizing of visual agent trajectory data through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, offering insights into obstacles, solutions, and trade-offs one may meet in developing open LMM agents. Our work not only aims to benchmark existing models but also provides an instrumental playground for future development into visual foundation agents. Code, train, and test data are available at \url{https://github.com/THUDM/VisualAgentBench}.
Live content is unavailable. Log in and register to view live content