Skip to yearly menu bar Skip to main content


Poster
in
Workshop: World Models: Understanding, Modelling and Scaling

BEYOND SINGLE-STEP: MULTI-FRAME ACTION- CONDITIONED VIDEO GENERATION FOR REINFORCE- MENT LEARNING ENVIRONMENTS

Zongyue Li · Sikuan Yan · Yunpu Ma · Yusong Li · Xing Lyu · Matthias Schubert

Keywords: [ Video Generation ] [ Diffusion World Model ] [ Dynamics Learning ] [ Action Conditioned ] [ Reinforcement Learning ]


Abstract:

World models achieved great success in learning the dynamics from both low-dimensional and high-dimensional states. Yet, there is no existing work to address multi-step generation task with high dimensional data. In this paper, we propose the first action-conditioned multi-frame video generation model, advancing world model development by generating future states from actions. As opposed to recent single-step or autoregressive approaches, our model directly generates multiple future frames conditioned on past observations and action sequences. Our framework extends its capabilities to action-conditioned video generation by introducing an action encoder. This addition enables the spatiotemporal variational autoencoder and diffusion transformer in Open-Sora to effectively incorporate action information, ensuring precise and coherent video generation. We evaluated performance on Atari environments (Breakout, Pong, DemonAttack) using MSE, PSNR, and LPIPS. Results show that conditioning solely on future actions and embedding-based encoding improve generation accuracy and perceptual quality while capturing complex temporal dependencies like inertia. Our work paves the way for action-conditioned multi-step generative world models in dynamic environments.

Chat is not available.