Poster Sat, Apr 25, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 3 P3-#1924

R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei ⋅ Jiang Guo ⋅ Shuaichen Chang ⋅ Mingwen Dong ⋅ Dongkyu Lee ⋅ Xing Niu ⋅ Jiarong Jiang

[ Slides] [ Poster] [ OpenReview]

Abstract

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLM's tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models -- future state prediction and reward estimation -- through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs’ limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves relative improvements of up to 23.4\% and 16.3\% on the subsets of OSWorld and Webarena compared to baselines, with particular advantage in longer-horizon simulations.

Video

Chat is not available.