Skip to yearly menu bar Skip to main content


Poster

Causal Graphical Models for Vision-Language Compositional Understanding

Fiorenzo Parascandolo · Nicholas Moratelli · Enver Sangineto · Lorenzo Baraldi · Rita Cucchiara

Hall 3 + Hall 2B #422
[ ] [ Project Page ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Recent work has empirically shown that Vision-Language Models (VLMs) struggleto fully understand the compositional properties of the human language, usuallymodeling an image caption as a “bag of words”. As a result, they performpoorly on compositional tasks, which require a deeper understanding of the differententities of a sentence (subject, verb, etc.) jointly with their mutual relationshipsin order to be solved. In this paper, we model the dependency relationsamong textual and visual tokens using a Causal Graphical Model (CGM), built usinga dependency parser, and we train a decoder conditioned by the VLM visualencoder. Differently from standard autoregressive or parallel predictions, our decoder’sgenerative process is partially-ordered following the CGM structure. Thisstructure encourages the decoder to learn only the main causal dependencies ina sentence discarding spurious correlations. Using extensive experiments on fivecompositional benchmarks, we show that our method significantly outperformsall the state-of-the-art compositional approaches by a large margin, and it also improvesover methods trained using much larger datasets. Our model weights and code are publicly available.

Live content is unavailable. Log in and register to view live content