Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models

Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding

Karun Sharma · Vidushee Vats

Project Page [ OpenReview]

Abstract

Visual grounding tasks involve identifying objects and references in an image based on text input. A model is required to locate the objects and their relationships as well as understand the image to accurately ground the target. Specialized models like Owl-ViT and Grounding DINO often fail to predict correct results for queries involving complex spatial information. In this paper, we propose a Spatial Thinking and Reasoning Dataset for visual grounding and a framework that uses existing detection models to identify candidate objects. These models provide coordinates and other attributes to a large language model (LLM) for spatial reasoning to determine the correct target. Recent closed-source models like GPT-4o achieve approximately 86% accuracy, while open-source models perform significantly worse, reaching only about 58% accuracy in our experiments. To improve this we use reinforcement learning to fine-tune a 3B model on our dataset, achieving 77% accuracy, comparable to closed-source models.

Chat is not available.