Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Towards understanding the modality gap in CLIP

Peiyang Shi · Michael Welle · Mårten Björkman · Danica Kragic

Keywords: [ Multimodality ] [ contrastive learning ] [ CLIP ] [ modality gap ] [ NT-Xent ] [ SLIP ]


Abstract:

This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap.

Chat is not available.