Poster
Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
Toshimitsu Uesaka · Taiji Suzuki · Yuhta Takida · Chieh-Hsin Lai · Naoki Murata · Yuki Mitsufuji
Hall 3 + Hall 2B #332
In typical multimodal contrastive learning, such as CLIP, encoders produce onepoint in the latent representation space for each input. However, one-point representationhas difficulty in capturing the relationship and the similarity structure of ahuge amount of instances in the real world. For richer classes of the similarity, wepropose the use of weighted point sets, namely, sets of pairs of weight and vector,as representations of instances. In this work, we theoretically show the benefitof our proposed method through a new understanding of the contrastive loss ofCLIP, which we call symmetric InfoNCE. We clarify that the optimal similaritythat minimizes symmetric InfoNCE is the pointwise mutual information, and showan upper bound of excess risk on downstream classification tasks of representationsthat achieve the optimal similarity. In addition, we show that our proposedsimilarity based on weighted point sets consistently achieves the optimal similarity.To verify the effectiveness of our proposed method, we demonstrate pretraining oftext-image representation models and classification tasks on common benchmarks.
Live content is unavailable. Log in and register to view live content