Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Injecting large models with new modalities for Video Understanding

Arsha Nagrani


Abstract:

Large models have had an `explosion’ moment recently, achieving state of the art results across various benchmarks and tasks. Here we discuss how they can be adapted to novel vision and audio inputs for multimodal tasks, either by influencing model design, or as frozen components in multimodal architectures. We focus on multimodal video captioning tasks such as ASR and automatic AD for movies, and cover some recently accepted papers at CVPR 2023.

Chat is not available.