Poster
in
Workshop: Workshop on Large Language Models for Agents
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
Weihao Tan · Ziluo Ding · Wentao Zhang · Boyu Li · Bohan Zhou · Junpeng Yue · Haochong Xia · Jiechuan Jiang · Longtao Zheng · Xinrun Xu · Yifei Bi · Pengjie Gu · Xinrun Wang · Börje Karlsson · Bo An · Zongqing Lu
The foundation agent is one of the promising ways to achieve Artificial General Intelligence. Recent studies have demonstrated its success in specific tasks or scenarios. However, existing foundation agents cannot generalize across different scenarios, mainly due to their diverse observation and action spaces and semantic gaps. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking the screen (and possibly audio) of the computer as input and keyboard and mouse operations as output, similar to human-computer interaction. To target GCC, we propose Cradle, which has strong reasoning abilities, including self-reflection, task inference, and skill curation, to ensure its generalizability and self-improvement across various computer tasks. To demonstrate the capabilities of Cradle, we deploy it in the famous AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC. Our agent can follow the main storyline and finish real missions in this complex AAA game, with minimal reliance on prior knowledge.