TangramSR: A Benchmark for Recursive Self-Improvement In Continuous Geometric Reasoning
Abstract
ision–Language Models (VLMs) have achieved remarkable success on discrete multimodal benchmarks, yet struggle with continuous geometric reasoning tasks that require precise spatial alignment. This paper addresses a fundamental chal- lenge in self-improving AI: how can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning with reward-guided feedback loops to enable VLMs to improve geometric alignment through iterative corrections. Our approach operates on Tangram puzzle assembly, a mathematically rigorous, NP-hard shape arrangement task requiring precise estimation of position, rotation, and scale. We establish a continuous-space evaluation benchmark that decom- poses geometric reasoning into factorized subtasks (position, angle, size) and mea- sures performance using ℓ2 distance and polygonal intersection-over-union (IoU). Comprehensive experiments across five representative VLMs reveal systematic performance gaps (average IoU 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition). Our training-free verifier–refiner agent applies recur- sive refinement loops that iteratively self-refine predictions based on geometric consistency feedback. Starting from initial predictions with low IoU (0.63), the recursive loop progressively improves geometric alignment through multiple it- erations, achieving IoU of 0.932 on medium-triangle cases without any model retraining. This demonstrates that recursive self-improvement can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains.