Moderators: Bohyung Han · Cordelia Schmid
In this talk, we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next, we present an approach for video question answering which relies on training from instruction videos and cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA) and demonstrate that our approach obtains competitive results for pre-training and then fine-tuning on video question answering datasets. We conclude our talk by presenting the recent VideoCC dataset, which transfers image captions to video and allows obtaining state-of-the-art performance for zero-shot video and audio retrieval and video captioning.