Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Understanding multimodal contrastive learning through pointwise mutual information
Toshimitsu Uesaka · Taiji Suzuki · Yuhta Takida · Chieh-Hsin Lai · Naoki Murata · Yuki Mitsufuji
Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications.The symmetric InfoNCE loss proposed in CLIP is one of key concepts in multimodal representation learning.In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions.Based on our theoretical results, we also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability.To verify the effectiveness of the proposed method, we demonstrate pretraining of multimodal representation models on the Conceptual Caption datasets and evaluate zero-shot classification and linear classification on common benchmark datasets.