ICLR Poster VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Poster

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang · Yinheng Li · Dan Zhao · Charles Ding · Justin Lin · Paul Pu Liang · Rogerio Bonatti · Kazuhito Koishida

Hall 3 + Hall 2B #314

[ Abstract ] [ Project Page ]

Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Videos are often used to learn or extract the necessary information to completetasks in ways different than what text or static imagery can provide. However, manyexisting agent benchmarks neglect long-context video understanding, instead focus-ing on text or static image inputs. To bridge this gap, we introduce VideoWebArena(VideoWA), a benchmark for evaluating the capabilities of long-context multimodalagents for video understanding. VideoWA consists of 2,021 web agent tasks basedon manually crafted video tutorials, which total almost four hours of content. Forour benchmark, we define a taxonomy of long-context video-based agent tasks withtwo main areas of focus: skill retention and factual retention. While skill retentiontasks evaluate whether an agent can use a given human demonstration to completea task efficiently, the factual retention task evaluates whether an agent can retrieveinstruction-relevant information from a video to complete a task. We find that thebest model achieves a 13.3% success rate on factual retention tasks and 45.8% onfactual retention QA pairs—far below human success rates of 73.9% and 79.3%,respectively. On skill retention tasks, long-context models perform worse withtutorials than without, exhibiting a 5% performance decrease in WebArena tasksand a 10.3% decrease in VisualWebArena tasks. Our work highlights performancegaps in the agentic abilities of long-context multimodal models and provides as atestbed for the future development of long-context video agents.

Live content is unavailable. Log in and register to view live content