Human3R: Everyone Everywhere All at Once
Abstract
We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world coordinate frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies (i.e., human detection and cropping, tracking, segmentation, camera pose or metric depth estimation, SLAM for 3D scenes, local human mesh recovery, etc.), Human3R jointly recovers global multi-person SMPL-X bodies (“everyone”), dense 3D scene geometry (“everywhere”), and camera trajectories in a single forward pass (“all-at-once”). Our method builds upon the 4D reconstruction foundation model CUT3R, and leverages parameter-efficient visual prompt tuning to preserve its original rich spatiotemporal priors while enabling direct readout of SMPL-X parameters. To further improve the accuracy of global human pose and shape estimation, we introduce a bottom-up (one-shot) multi-person SMPL-X regressor, trained on human-specific datasets. By removing heavy dependencies and iterative refinement, and only training on a relatively small-scale synthetic dataset, BEDLAM, Human3R achieves state-of-the-art performance with remarkable efficiency: it requires just one day of training on a single consumer GPU (NVIDIA RTX 4090) and operates in real time (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance, across all relevant tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. In summary, Human3R achieves one unified model, one-stage inference, one-shot multi-person estimation, and requires just one day of training on one GPU — enabling real-time, online processing of streaming inputs. We hope that Human3R will serve as a simple yet effective baseline, which can be easily extended by other researchers for new applications, such as 6D object pose estimation (“everything”), thereby facilitating future research in this direction. Code and models will be made publicly available.