Multimodal Representation Learning (MRL): Perks and Pitfalls

Workshop

Multimodal Representation Learning (MRL): Perks and Pitfalls

Adrián Javaloy · Miguel Vasco · Imant Daunhawer · Petra Poklukar · Yuge Shi · Danica Kragic · Isabel Valera

Virtual

Fri 5 May, midnight PDT

[ Abstract ] Workshop Website

Following deep learning, multimodal machine learning has made steady progress, becoming ubiquitous in many domains. Learning representations from multiple modalities can be beneficial since different perceptual modalities can inform each other and ground abstract phenomena in a more robust, generalisable way. However, the complexity of different modalities can hinder the training process, requiring careful design of the model in order to learn meaningful representations. In light of these seemingly conflicting aspects of multimodal learning, we must improve our understanding of what makes each modality different, how they interact, and what are the desiderata of multimodal representations. With this workshop, we aim to bring the multimodal community together, promoting work on multimodal representation learning that provides systematic insights into the nature of the learned representations, as well as ways to improve and understand the training of multimodal models, both from a theoretical and empirical point of view.In particular, we focus on the following questions:(Representation) How do we identify useful properties of multimodal representations?(Training) How can we promote useful properties of multimodal representations?(Modalities) What makes a modality different? How can we improve their interactions?The MRL workshop has an objective to bring together experts from the multimodal learning community in order to advance these fundamental questions and discuss the future of the field. We invite submissions that present analysis of the properties of multimodal representations, insights on interactions across modalities, as well as novel applications regarding the nature and number of modalities employed.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 12:00 a.m. - 12:10 a.m.	Introduction and Opening Remarks ( Intro ) > SlidesLive Video	🔗
Fri 12:10 a.m. - 12:40 a.m.	Foundations of Multimodal Machine Learning: Principles, Challenges, and Open Questions ( Invited Talk ) > SlidesLive Video	Paul Pu Liang 🔗
Fri 12:45 a.m. - 12:55 a.m.	Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance ( Poster ) > link Link	Chaerin Kong · Nojun Kwak 🔗
Fri 12:55 a.m. - 1:00 a.m.	Q&A ( Q&A ) >	Chaerin Kong · Nojun Kwak 🔗
Fri 1:00 a.m. - 1:10 a.m.	Hyperbolic Image-Text Representations ( Poster ) > link SlidesLive Video Link	Karan Desai · Maximilian Nickel · Tanmay Rajpurohit · Justin Johnson · Shanmukha Ramakrishna Vedantam 🔗
Fri 1:11 a.m. - 1:20 a.m.	Coffee break ( Q&A ) >	🔗
Fri 1:20 a.m. - 2:00 a.m.	Compositionality and Abstraction in Multimodal Learning ( Invited Talk ) > SlidesLive Video	Zeynep Akata 🔗
Fri 2:00 a.m. - 2:03 a.m.	Interpreting Multimodal Video Transformers Using Brain Recordings ( Poster ) > link SlidesLive Video Link	Tianai Dong · Mariya Toneva 🔗
Fri 2:03 a.m. - 2:06 a.m.	A Picture is Worth a Thousand Words: Language Models Plan from Pixels ( Poster ) > link SlidesLive Video Link	Anthony Z Liu · Lajanugen Logeswaran · Sungryull Sohn · Honglak Lee 🔗
Fri 2:06 a.m. - 2:09 a.m.	Dynamic Pretraining of Vision-Language Models ( Poster ) > link SlidesLive Video Link	AJ Piergiovanni · Weicheng Kuo · Wei Li · Anelia Angelova 🔗
Fri 2:09 a.m. - 2:12 a.m.	CHiLS: Zero-shot Image Classification with Hierarchical Label Sets ( Poster ) > link SlidesLive Video Link	Zachary Novack · Saurabh Garg · Julian McAuley · Zachary Lipton 🔗
Fri 2:12 a.m. - 2:15 a.m.	Towards understanding the modality gap in CLIP ( Poster ) > link SlidesLive Video Link	Peiyang Shi · Michael Welle · Mårten Björkman · Danica Kragic 🔗
Fri 2:18 a.m. - 3:00 a.m.	Poster Session ( Poster Session ) > link Link	🔗
Fri 3:00 a.m. - 4:30 a.m.	Lunch Break	🔗
Fri 4:30 a.m. - 5:10 a.m.	Learning Visual Features Enriched by Audio or Language ( Invited Talk ) > link SlidesLive Video Link	Kristen Grauman 🔗
Fri 5:10 a.m. - 5:13 a.m.	Using Multimodal DNNs to Localize Vision-Language Integration in the Brain ( Poster ) > link SlidesLive Video Link	Vighnesh Subramaniam · Colin Conwell · Christopher Wang · Gabriel Kreiman · Boris Katz · Ignacio Cases · Andrei Barbu 🔗
Fri 5:13 a.m. - 5:16 a.m.	The Role of Pre-training Data in Transfer Learning ( Poster ) > link SlidesLive Video Link	Rahim Entezari · Mitchell Wortsman · Olga Saukh · Moein Shariatnia · Hanie Sedghi · Ludwig Schmidt 🔗
Fri 5:16 a.m. - 5:19 a.m.	Multimodal Subtask Graph Generation from Instructional Videos ( Poster ) > link SlidesLive Video Link	Yunseok Jang · Sungryull Sohn · Tiange Luo · Lajanugen Logeswaran · Moontae Lee · Honglak Lee 🔗
Fri 5:19 a.m. - 5:22 a.m.	Exploiting Category Names for Few-Shot Classification with Vision-Language Models ( Poster ) > link SlidesLive Video Link	Taihong Xiao · Zirui Wang · Liangliang Cao · Jiahui Yu · Shengyang Dai · Ming-Hsuan Yang 🔗
Fri 5:22 a.m. - 5:25 a.m.	Classifier-free guidance makes image captioning models more descriptive ( Poster ) > link SlidesLive Video Link	Simon Kornblith · Lala Li · Zirui Wang · Thao Nguyen 🔗
Fri 5:25 a.m. - 5:28 a.m.	Impossibility of Collective Intelligence ( Poster ) > link SlidesLive Video Link	Krikamol Muandet 🔗
Fri 5:28 a.m. - 6:10 a.m.	Poster Session ( Poster Session ) > link Link	🔗
Fri 6:10 a.m. - 6:20 a.m.	Instruction-Finetuned Foundation Models for Multimodal Web Navigation ( Poster ) > link Link	Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗
Fri 6:20 a.m. - 6:25 a.m.	Q&A ( Q&A ) >	Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗
Fri 6:25 a.m. - 6:35 a.m.	SemDeDup: Data-efficient learning at web-scale through semantic deduplication ( Poster ) > link Link	Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗
Fri 6:35 a.m. - 6:40 a.m.	Q&A ( Q&A ) >	Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗
Fri 6:43 a.m. - 6:45 a.m.	Coffee Break	🔗
Fri 6:45 a.m. - 7:15 a.m.	Injecting large models with new modalities for Video Understanding ( Invited Talk ) > SlidesLive Video	Arsha Nagrani 🔗
Fri 7:20 a.m. - 7:50 a.m.	Towards Structured Multimodal Representations ( Invited Talk ) > SlidesLive Video	Siddharth N 🔗
Fri 7:50 a.m. - 8:00 a.m.	Coffee Break	🔗
Fri 8:00 a.m. - 8:45 a.m.	The Perks and Pitfalls of MRL ( Panel ) > SlidesLive Video	Arsha Nagrani · Luca Moschella · Paul Pu Liang · Siddharth N · Valentino Maiorca 🔗
Fri 8:45 a.m. - 9:00 a.m.	Closing Remarks ( Closing ) > SlidesLive Video	🔗
-	Text-to-Image Diffusion Models are Zero-Shot Classifiers ( Poster ) > link Link	Kevin Clark · Priyank Jaini 🔗