Action Shapley: A training data selection metric for Training World Models for Reinforcement Learning
Abstract
World models are central to model-based reinforcement learning, enabling agents to predict environment dynamics and reason about future outcomes. In real-world settings, however, training high-fidelity world models is often constrained by limited, noisy, and heterogeneous interaction data, making data selection a critical yet under-studied problem. To address this gap, we introduce Action Shapley, a principled data valuation metric for world model training for reinforcement learning. Action Shapley quantifies the marginal contribution of individual action–state trajectories to downstream control performance, enabling systematic selection of high-value training data. To address the exponential cost of Shapley value computation, we propose a randomized algorithm that exploits failure modes of learned world models to identify a cut-off cardinality, significantly reducing computation while preserving ranking fidelity. We evaluate Action Shapley across four real-world, partially observable control domains—including cloud resource management, database tuning, and Kubernetes workload control—using model-based RL agents. Across all domains, Action Shapley–based data selection improves data efficiency by up to 67\% and often outperforms training on the full dataset in cumulative reward.