Skip to yearly menu bar Skip to main content

Affinity Workshop: Tiny Papers Oral Session 3

Dissecting Zero-Shot Visual Reasoning Capabilities in Vision and Language Models

Aishik Nagar · Shantanu Jaiswal · Cheston Tan


Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, existing works (typically) use benchmarks that conflate “pure” visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM’s apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We specifically focus on evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM. We notably find that the underlying LLMs, when provided textual scene descriptions, consistently perform significantly better compared to being provided visual embeddings. Our work comprehensively identifies limitations of VLMs for compositional visual reasoning, and highlights the important role that LLMs can play in scene understanding and visual reasoning.

Chat is not available.