Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Dynamic Pretraining of Vision-Language Models

AJ Piergiovanni · Weicheng Kuo · Wei Li · Anelia Angelova

Keywords: [ pretraining ] [ sampling ] [ vision language ] [ curriculum learning ]


Abstract:

Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. While most models have taken the direction of scaling training to increasingly large models and datasets, in this paper, we propose a dynamic pretraining resampling approach which utilizes a variety of pretraining tasks, and which results in more sample-efficient models. We show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. We show that a single 330M param pretrained model using only smaller and publicly accessible image-language datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering.

Chat is not available.