ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Abstract
Multimodal reasoning is a dynamic process that requires synergistic coordination of language and vision. However, current approaches to multimodal interleaved generation fall short of providing a generalizable recipe that productively engages text and vision to advance reasoning. We introduce ThinkMorph, a unified thinking model capable of effective interleaved reasoning. By constructing a high-quality pipeline for generating interleaved reasoning data for training unified models, we enable ThinkMorph to generate multimodal reasoning traces where language and vision mutually advance each other. ThinkMorph delivers substantial gains on vision-centric reasoning, including +11.53\% on visual search and +38.75\% on jigsaw assembly over the base model. It also reaches 80.33\% on MMVP and 52.67\% on SAT, indicating strong generalization. The improvements are large enough to close the gap with, and in some cases even surpass, leading large-scale or proprietary VLMs. Moreover, ThinkMorph reveals \emph{emergent properties} indicative of higher-level multimodal intelligence. These include unseen visual manipulation skills during finetuning, such as zoom-in and image inpainting, as well as autonomous reasoning mode switching, wherein a model trained exclusively on interleaved data chooses to engage in text-only reasoning due to the nature of the task, for instance. We show that this ability to think in text, vision, and multimodality opens new avenues for test-time scaling, allowing ThinkMorph to \textit{effectively scale and aggregate thoughts across three reasoning modes}. These findings suggest promising directions for future work to characterize the emergent capabilities of unified models for multimodal reasoning.