EquAct: An SE(3)-Equivariant Multi-Task Transformer for 3D Robotic Manipulation
Xupeng Zhu · Yu Qi · Yizhe Zhu · Robin Walters · Robert Platt
Abstract
Multi-task manipulation policy often builds on transformer's ability to jointly process language instructions and 3D observations in a shared embedding space. However, real-world tasks frequently require robots to generalize to novel 3D object poses. Policies based on shared embedding break geometric consistency and struggle in 3D generation. To address this issue, we propose EquAct, which is theoretically guaranteed to generalize to novel 3D scene transformations by leveraging SE(3) equivariance shared across both language, observations, and action. EquAct makes two key contributions: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. Finally, EquAct demonstrates strong spatial generalization ability and achieves state-of-the-art across $18$ RLBench tasks with both SE(3) and SE(2) scene perturbations, different amounts of training data, and on $4$ physical tasks.
Successful Page Load