Poster
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang · Yinheng Li · Dan Zhao · Charles Ding · Justin Lin · Paul Pu Liang · Rogerio Bonatti · Kazuhito Koishida
Hall 3 + Hall 2B #314
Videos are often used to learn or extract the necessary information to completetasks in ways different than what text or static imagery can provide. However, manyexisting agent benchmarks neglect long-context video understanding, instead focus-ing on text or static image inputs. To bridge this gap, we introduce VideoWebArena(VideoWA), a benchmark for evaluating the capabilities of long-context multimodalagents for video understanding. VideoWA consists of 2,021 web agent tasks basedon manually crafted video tutorials, which total almost four hours of content. Forour benchmark, we define a taxonomy of long-context video-based agent tasks withtwo main areas of focus: skill retention and factual retention. While skill retentiontasks evaluate whether an agent can use a given human demonstration to completea task efficiently, the factual retention task evaluates whether an agent can retrieveinstruction-relevant information from a video to complete a task. We find that thebest model achieves a 13.3% success rate on factual retention tasks and 45.8% onfactual retention QA pairs—far below human success rates of 73.9% and 79.3%,respectively. On skill retention tasks, long-context models perform worse withtutorials than without, exhibiting a 5% performance decrease in WebArena tasksand a 10.3% decrease in VisualWebArena tasks. Our work highlights performancegaps in the agentic abilities of long-context multimodal models and provides as atestbed for the future development of long-context video agents.
Live content is unavailable. Log in and register to view live content