Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control
Abstract
Scaling imitation learning to high-DoF whole-body robots is fundamentally constrained by the scarcity of expert demonstrations. In contrast, large amounts of suboptimal data are readily available and offer a practical way to alleviate supervision bottlenecks in real-world whole-body control. However, leveraging such data introduces two central challenges: how to extract informative signals from imperfect trajectories, and how to cope with the increased learning complexity induced by high-dimensional control. To overcome this, we propose HVD (Hierarchical Value-Decomposed Offline Reinforcement Learning). The offline RL formulation provides principled data selection over suboptimal datasets, enabling the policy to prioritize high-value behaviors while down-weighting harmful ones. Complementarily, hierarchical value decomposition organizes learning along the robot’s kinematic structure, improving credit assignment and reducing learning complexity in high-DoF systems. Built on a Transformer-based architecture, HVD supports multi-modal and multi-task learning, allowing flexible integration of diverse sensory inputs. To enable realistic evaluation and training, we further introduce WB-50, a 50-hour dataset of teleoperated and policy rollout trajectories annotated with rewards and preserving natural imperfections, including partial successes, corrections, and failures. Experiments show HVD significantly outperforms existing baselines in success rate across complex whole-body tasks. Our results suggest effective policy learning for high-DoF systems can emerge not from perfect demonstrations, but from structured learning over realistic, imperfect data. Our code is available at https://github.com/LAMDA-RL/HVD.