Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models
Abstract
Vision-Language-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across multimodal models and model-based RL, we propose VLA-MBPO, a practical world model-based RL framework to tackle these problems in VLA finetuning. Our approach is guided by three key design choices: (i) adapting unified multimodal models (UMMs) to VLA settings, leveraging rich multimodal priors to enable world modeling with limited data; (ii) introducing an interleaved view decoding mechanism to enforce consistency across views; and (iii) employing chunk-level branched rollout to limit rollout horizons and mitigate error compounding during policy optimization. Our theoretical analysis shows a reduction in value gap of VLA-MBPO, and experiments in both simulated and real-world tasks demonstrate that our method effectively improves policy performance and sample efficiency for VLA finetuning.