Blog Track Poster Sat, Apr 25, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 4 P4-#3714

Where’s the Chicken? Unpacking Spatial Awareness in Vision-Language Models

Jiyoon Pyo ⋅ Yao-Yi Chiang

Project Page [ Poster] [ OpenReview]

Abstract

Modern vision-language models (VLMs) have achieved impressive success in recognizing and describing visual content, yet they continue to struggle with understanding spatial relationships. The limitation persists despite massive data and model scaling, suggesting that the root of the problem lies in the architecture and training objective rather than data alone. This post examines the underlying causes and discusses why recent proposed fixes, while promising, remain insufficient to achieve robust spatial reasoning.

Video

Chat is not available.