Invited Talk
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls
Learning Visual Features Enriched by Audio or Language
Kristen Grauman
Multimodal perception feature learning has great potential to unlock problems in video understanding, augmented reality, and embodied AI. I will present some of our recent work in learning with audio-visual (AV) and visual-language (VL) modalities. First, we explore how audio’s spatial signals can augment visual understanding of 3D environments. This includes ideas for self-supervised feature learning from echoes and AV floorplan reconstruction. Next, building on these spatial AV and scene acoustics ideas, we introduce new ways to enhance the audio stream – making it possible to transport a sound to a new physical environment observed in a photo, or to dereverberate speech so it is intelligible for machine and human ears alike. Throughout this line of work, we leverage our open-source SoundSpaces platform, which provides state-of-the-art rendering of highly realistic audio in real-world scanned environments, and thereby facilitates self-supervised AV learning. Finally, we propose a hierarchical video-language (VL) embedding that simultaneously learns to account for both the “what” (step-by-step activity) and the “why” (intention of the actor) in egocentric video.