[Tiny Paper] Training-Free Construction of Executable 3D Worlds from Narrative Text
Abstract
A prominent trend in recent world generation and world model research emphasizes foundation-scale approaches, particularly diffusion-based architectures trained on large video and multimodal datasets, often requiring significant computational resources. While effective, this paradigm implicitly assumes access to infrastructure that is unavailable to many researchers and practitioners. In this work, we explore an alternative perspective on world model construction under strict compute constraints. We present a modular, training-free framework that leverages existing multimodal large language models (MLLMs) and open-source text-to-3D asset generators through lightweight API calls to construct story-driven, navigable 3D worlds. Rather than learning world dynamics end-to-end, our system extracts structured semantic representations from narrative text and deterministically compiles them into spatial layouts, connectivity graphs, and executable environments. We demonstrate that coherent, traversable worlds can be generated on commodity hardware, suggesting that world model research can advance not only through scaling compute, but also through structural abstraction, compositional design, and systems-level reasoning.