Skip to yearly menu bar Skip to main content


Poster

X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Jilan Xu · Yifei Huang · Baoqi Pei · Junlin Hou · Qingqiu Li · Guo Chen · Yuejie Zhang · Rui Feng · Weidi Xie

Hall 3 + Hall 2B #568
[ ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence.In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate future frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present X-Gen that explicitly models the hand-object dynamics for cross-view video prediction. X-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality.To facilitate training, we develop a fully automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed X-Gen achieves better prediction performance compared to previous video prediction models on the public Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Live content is unavailable. Log in and register to view live content