Where’s the Chicken? Unpacking Spatial Awareness in Vision-Language Models
Jiyoon Pyo · Yao-Yi Chiang
Abstract
Modern vision-language models (VLMs) have achieved impressive success in recognizing and describing visual content, yet they continue to struggle with understanding spatial relationships. The limitation persists despite massive data and model scaling, suggesting that the root of the problem lies in the architecture and training objective rather than data alone. This post examines the underlying causes and discusses why recent proposed fixes, while promising, remain insufficient to achieve robust spatial reasoning.
Successful Page Load