Poster Sat, Apr 25, 2026 • 6:30 AM – 9:00 AM PDT

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Mingyuan Wu ⋅ Jingcheng Yang ⋅ Jize Jiang ⋅ Meitang Li ⋅ Kaizhuo Yan ⋅ Hanchao Yu ⋅ Minjia Zhang ⋅ ChengXiang Zhai ⋅ Klara Nahrstedt

Project Page [ OpenReview]

Abstract

Reinforcement learning finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, multi-turn self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely focus on text-only reasoning conditioned on original image inputs, and do not incorporate visual reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first RFT framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that enhance the final output quality. Trained with outcome-based rewards, our approach elicits strategic visual tool use for multi-modal reasoning without relying on process-based supervision. Extensive experiments on structured visual reasoning over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools. To support future research in multi-turn multi-modal reasoning, we open-source our code at https://github.com/VTOOL-R1/vtool-r1.

Video

Chat is not available.