Poster Fri, Apr 24, 2026 • 11:15 AM – 1:45 PM PDT

Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Yimeng Zhang ⋅ Tian Wang ⋅ Jiri Gesi ⋅ Ziyi Wang ⋅ Yuxuan Lu ⋅ Jiacheng Lin ⋅ Simon Zhan ⋅ Vianne Gao ⋅ Ruochen Jiao ⋅ Junze Liu ⋅ Kun Qian ⋅ Yuxin Tang ⋅ Ran Xue ⋅ Houyu Zhang ⋅ Qingjun Cui ⋅ Yufan Guo ⋅ Dakuo Wang

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) have recently demonstrated strong potential in generating ‘believable human-like’ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline. The project page is available at https://damon-demon.github.io/shop-r1.html.

Video

Chat is not available.