Poster
in
Workshop: The First Workshop on Efficient Spatial Reasoning Mon, Apr 27, 2026 • 7:45 AM – 8:45 AM PDT

Probing Visual Planning in Image Editing Models

Zhimu Zhou ⋅ Yanpeng Zhao ⋅ Qiuyu Liao ⋅ Bo ZHAO ⋅ Xiaojian Ma

Project Page [ OpenReview]

Abstract

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step `planning-by-generation' paradigm. In this work, we present \model, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we utilize maze navigation as the probing task and introduce \bench, a procedurally generated dataset that spans four distinct geometric complexities. The abstract nature of these mazes facilitates rigorous automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and topological correctness. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic $3\times3$ mazes enables remarkable generalization to $16\times16$ in-domain mazes and out-of-domain geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

Chat is not available.