Poster
in
Workshop: ICLR 2026 Workshop on AI with Recursive Self-Improvement

TangramSR: A Benchmark for Recursive Self-Improvement In Continuous Geometric Reasoning

Yikun Zong ⋅ Cheston Tan

Project Page [ OpenReview]

Abstract

ision–Language Models (VLMs) have achieved remarkable success on discrete multimodal benchmarks, yet struggle with continuous geometric reasoning tasks that require precise spatial alignment. This paper addresses a fundamental chal- lenge in self-improving AI: how can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning with reward-guided feedback loops to enable VLMs to improve geometric alignment through iterative corrections. Our approach operates on Tangram puzzle assembly, a mathematically rigorous, NP-hard shape arrangement task requiring precise estimation of position, rotation, and scale. We establish a continuous-space evaluation benchmark that decom- poses geometric reasoning into factorized subtasks (position, angle, size) and mea- sures performance using ℓ2 distance and polygonal intersection-over-union (IoU). Comprehensive experiments across five representative VLMs reveal systematic performance gaps (average IoU 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition). Our training-free verifier–refiner agent applies recur- sive refinement loops that iteratively self-refine predictions based on geometric consistency feedback. Starting from initial predictions with low IoU (0.63), the recursive loop progressively improves geometric alignment through multiple it- erations, achieving IoU of 0.932 on medium-triangle cases without any model retraining. This demonstrates that recursive self-improvement can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains.

Chat is not available.