The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning
Abstract
This paper presents a surprising negative result: despite their success in discrete reasoning, Vision–Language Models (VLMs) fail catastrophically in continuous geometric reasoning, achieving only 0.41 IoU on single-piece tasks and 0.23 on two-piece composition, far below human performance. Humans can complete tangram tasks even in childhood, demonstrating significantly high continuous spatial reasoning ability (Bohning & Althouse, 1997). Comprehensive experiments across state-of-the-art VLMs (GPT-4o, Gemini, Claude, Qwen, LLaMA) show that while test-time self-improvement through reward-guided refinement loops does improve predictions (0.63→0.93 IoU on single-piece cases), this refinement is far from sufficient to close the gap: even after self-improvement, performance remains below human level, gains do not reliably generalize, and multi-piece tasks would face even greater challenges even with refinement. Thus, our negative result targets VLMs’ continuous-space reasoning ability, not the existence of test-time refinement itself. We posit five underlying limitations, training distribution mismatch, output format constraints treating coordinates as text strings, visual encoder geometric invariance, positional embedding precision limits, and absence of geometry-aware feedback and inductive biases and document boundary conditions where refinement helps (single-piece tasks) but saturates within 6 iterations, indicating systematic rather than correctable errors