Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Learning Visual Features Enriched by Audio or Language

Kristen Grauman


Abstract:

​​Multimodal perception feature learning has great potential to unlock problems in video understanding, augmented reality, and embodied AI. I will present some of our recent work in learning with audio-visual (AV) and visual-language (VL) modalities. First, we explore how audio’s spatial signals can augment visual understanding of 3D environments. This includes ideas for self-supervised feature learning from echoes and AV floorplan reconstruction. Next, building on these spatial AV and scene acoustics ideas, we introduce new ways to enhance the audio stream – making it possible to transport a sound to a new physical environment observed in a photo, or to dereverberate speech so it is intelligible for machine and human ears alike. Throughout this line of work, we leverage our open-source SoundSpaces platform, which provides state-of-the-art rendering of highly realistic audio in real-world scanned environments, and thereby facilitates self-supervised AV learning. Finally, we propose a hierarchical video-language (VL) embedding that simultaneously learns to account for both the “what” (step-by-step activity) and the “why” (intention of the actor) in egocentric video.

Chat is not available.