Medical thinking with multiple images
Zonghai Yao ⋅ Benlu Wang ⋅ Yifan Zhang ⋅ Junda Wang ⋅ Iris Xia ⋅ Zhipeng Tang ⋅ Shuo Han ⋅ Feiyun ⋅ Zhichao Yang ⋅ Arman Cohan ⋅ Hong Yu
Abstract
Large language models perform well on many medical QA benchmarks, but real clinical reasoning is harder because diagnosis often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, in which models must interpret each image, combine cross-view evidence, and solve diagnostic questions under intermediate supervision and step-level evaluation. The dataset contains 10,067 cases, including 720 test cases, with an average of 6.68 images per case, substantially denser than prior work (earlier maxima $\leq$ 1.43). On the test set, the best closed-source models, Claude-4.6-opus, Gemini-3-pro, and GPT-5.2-xhigh, achieve only 54.9%--57.2% accuracy, while smaller proprietary variants, GPT-5-mini/nano, drop to 39.7% and 30.8%. Top open-source models perform worse overall, with Qwen3.5-397B-A17B (52.2%) and Qwen3.5-27B (50.6%) leading, followed by Lingshu-32B (43.2%), InternVL3.5-38B (40.7%), and MedGemma-27B (31.8%). Further analysis points to a single-core bottleneck: current models struggle with grounded multi-image reasoning, i.e., reliably extracting, aligning, and composing evidence across views before higher-level inference can help. This is supported by three consistent findings: adding expert-provided single-image cues and integrating cross-image evidence improve performance, whereas replacing them with models’ self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors come from image reading and cross-view integration, with reasoning failures increasing on decisive steps. Scaling results show that while accuracy increases with more images, additional inference-time computation is beneficial only when the underlying visual grounding is already reliable. When early evidence extraction is weak, longer reasoning yields limited or unstable gains and can even amplify misread cues. Together, these results show that the main barrier is not simply insufficient reasoning length or depth, but the lack of reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world, cross-view, multimodal clinical inputs.
Successful Page Load