Probing Visual Planning in Image Editing Models
Zhimu Zhou ⋅ Yanpeng Zhao ⋅ Qiuyu Liao ⋅ Bo ZHAO ⋅ Xiaojian Ma
Abstract
Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step `planning-by-generation' paradigm. In this work, we present \model, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we utilize maze navigation as the probing task and introduce \bench, a procedurally generated dataset that spans four distinct geometric complexities. The abstract nature of these mazes facilitates rigorous automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and topological correctness. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic $3\times3$ mazes enables remarkable generalization to $16\times16$ in-domain mazes and out-of-domain geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.
Chat is not available.
Successful Page Load