Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

Yikun Zong ⋅ Cheston Tan

Project Page [ Poster] [ OpenReview]

Abstract

This paper presents a surprising negative result: despite their success in discrete reasoning, Vision–Language Models (VLMs) fail catastrophically in continuous geometric reasoning, achieving only 0.41 IoU on single-piece tasks and 0.23 on two-piece composition, far below human performance. Humans can complete tangram tasks even in childhood, demonstrating significantly high continuous spatial reasoning ability (Bohning & Althouse, 1997). Comprehensive experiments across state-of-the-art VLMs (GPT-4o, Gemini, Claude, Qwen, LLaMA) show that while test-time self-improvement through reward-guided refinement loops does improve predictions (0.63→0.93 IoU on single-piece cases), this refinement is far from sufficient to close the gap: even after self-improvement, performance remains below human level, gains do not reliably generalize, and multi-piece tasks would face even greater challenges even with refinement. Thus, our negative result targets VLMs’ continuous-space reasoning ability, not the existence of test-time refinement itself. We posit five underlying limitations, training distribution mismatch, output format constraints treating coordinates as text strings, visual encoder geometric invariance, positional embedding precision limits, and absence of geometry-aware feedback and inductive biases and document boundary conditions where refinement helps (single-piece tasks) but saturates within 6 iterations, indicating systematic rather than correctable errors

Chat is not available.