Workshop
Multimodal Representation Learning (MRL): Perks and Pitfalls
Adri谩n Javaloy 路 Miguel Vasco 路 Imant Daunhawer 路 Petra Poklukar 路 Yuge Shi 路 Danica Kragic 路 Isabel Valera
Virtual
Fri 5 May, midnight PDT
Following deep learning, multimodal machine learning has made steady progress, becoming ubiquitous in many domains. Learning representations from multiple modalities can be beneficial since different perceptual modalities can inform each other and ground abstract phenomena in a more robust, generalisable way. However, the complexity of different modalities can hinder the training process, requiring careful design of the model in order to learn meaningful representations. In light of these seemingly conflicting aspects of multimodal learning, we must improve our understanding of what makes each modality different, how they interact, and what are the desiderata of multimodal representations. With this workshop, we aim to bring the multimodal community together, promoting work on multimodal representation learning that provides systematic insights into the nature of the learned representations, as well as ways to improve and understand the training of multimodal models, both from a theoretical and empirical point of view.In particular, we focus on the following questions:(Representation) How do we identify useful properties of multimodal representations?(Training) How can we promote useful properties of multimodal representations?(Modalities) What makes a modality different? How can we improve their interactions?The MRL workshop has an objective to bring together experts from the multimodal learning community in order to advance these fundamental questions and discuss the future of the field. We invite submissions that present analysis of the properties of multimodal representations, insights on interactions across modalities, as well as novel applications regarding the nature and number of modalities employed.
Schedule
Fri 12:00 a.m. - 12:10 a.m.
|
Introduction and Opening Remarks
(
Intro
)
>
SlidesLive Video |
馃敆 |
Fri 12:10 a.m. - 12:40 a.m.
|
Foundations of Multimodal Machine Learning: Principles, Challenges, and Open Questions
(
Invited Talk
)
>
SlidesLive Video |
Paul Pu Liang 馃敆 |
Fri 12:45 a.m. - 12:55 a.m.
|
Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance ( Poster ) > link | Chaerin Kong 路 Nojun Kwak 馃敆 |
Fri 12:55 a.m. - 1:00 a.m.
|
Q&A
(
Q&A
)
>
|
Chaerin Kong 路 Nojun Kwak 馃敆 |
Fri 1:00 a.m. - 1:10 a.m.
|
Hyperbolic Image-Text Representations
(
Poster
)
>
link
SlidesLive Video |
Karan Desai 路 Maximilian Nickel 路 Tanmay Rajpurohit 路 Justin Johnson 路 Shanmukha Ramakrishna Vedantam 馃敆 |
Fri 1:11 a.m. - 1:20 a.m.
|
Coffee break
(
Q&A
)
>
|
馃敆 |
Fri 1:20 a.m. - 2:00 a.m.
|
Compositionality and Abstraction in Multimodal Learning
(
Invited Talk
)
>
SlidesLive Video |
Zeynep Akata 馃敆 |
Fri 2:00 a.m. - 2:03 a.m.
|
Interpreting Multimodal Video Transformers Using Brain Recordings
(
Poster
)
>
link
SlidesLive Video |
Tianai Dong 路 Mariya Toneva 馃敆 |
Fri 2:03 a.m. - 2:06 a.m.
|
A Picture is Worth a Thousand Words: Language Models Plan from Pixels
(
Poster
)
>
link
SlidesLive Video |
Anthony Z Liu 路 Lajanugen Logeswaran 路 Sungryull Sohn 路 Honglak Lee 馃敆 |
Fri 2:06 a.m. - 2:09 a.m.
|
Dynamic Pretraining of Vision-Language Models
(
Poster
)
>
link
SlidesLive Video |
AJ Piergiovanni 路 Weicheng Kuo 路 Wei Li 路 Anelia Angelova 馃敆 |
Fri 2:09 a.m. - 2:12 a.m.
|
CHiLS: Zero-shot Image Classification with Hierarchical Label Sets
(
Poster
)
>
link
SlidesLive Video |
Zachary Novack 路 Saurabh Garg 路 Julian McAuley 路 Zachary Lipton 馃敆 |
Fri 2:12 a.m. - 2:15 a.m.
|
Towards understanding the modality gap in CLIP
(
Poster
)
>
link
SlidesLive Video |
Peiyang Shi 路 Michael Welle 路 M氓rten Bj枚rkman 路 Danica Kragic 馃敆 |
Fri 2:18 a.m. - 3:00 a.m.
|
Poster Session ( Poster Session ) > link | 馃敆 |
Fri 3:00 a.m. - 4:30 a.m.
|
Lunch Break
|
馃敆 |
Fri 4:30 a.m. - 5:10 a.m.
|
Learning Visual Features Enriched by Audio or Language
(
Invited Talk
)
>
link
SlidesLive Video |
Kristen Grauman 馃敆 |
Fri 5:10 a.m. - 5:13 a.m.
|
Using Multimodal DNNs to Localize Vision-Language Integration in the Brain
(
Poster
)
>
link
SlidesLive Video |
Vighnesh Subramaniam 路 Colin Conwell 路 Christopher Wang 路 Gabriel Kreiman 路 Boris Katz 路 Ignacio Cases 路 Andrei Barbu 馃敆 |
Fri 5:13 a.m. - 5:16 a.m.
|
The Role of Pre-training Data in Transfer Learning
(
Poster
)
>
link
SlidesLive Video |
Rahim Entezari 路 Mitchell Wortsman 路 Olga Saukh 路 Moein Shariatnia 路 Hanie Sedghi 路 Ludwig Schmidt 馃敆 |
Fri 5:16 a.m. - 5:19 a.m.
|
Multimodal Subtask Graph Generation from Instructional Videos
(
Poster
)
>
link
SlidesLive Video |
Yunseok Jang 路 Sungryull Sohn 路 Tiange Luo 路 Lajanugen Logeswaran 路 Moontae Lee 路 Honglak Lee 馃敆 |
Fri 5:19 a.m. - 5:22 a.m.
|
Exploiting Category Names for Few-Shot Classification with Vision-Language Models
(
Poster
)
>
link
SlidesLive Video |
Taihong Xiao 路 Zirui Wang 路 Liangliang Cao 路 Jiahui Yu 路 Shengyang Dai 路 Ming-Hsuan Yang 馃敆 |
Fri 5:22 a.m. - 5:25 a.m.
|
Classifier-free guidance makes image captioning models more descriptive
(
Poster
)
>
link
SlidesLive Video |
Simon Kornblith 路 Lala Li 路 Zirui Wang 路 Thao Nguyen 馃敆 |
Fri 5:25 a.m. - 5:28 a.m.
|
Impossibility of Collective Intelligence
(
Poster
)
>
link
SlidesLive Video |
Krikamol Muandet 馃敆 |
Fri 5:28 a.m. - 6:10 a.m.
|
Poster Session ( Poster Session ) > link | 馃敆 |
Fri 6:10 a.m. - 6:20 a.m.
|
Instruction-Finetuned Foundation Models for Multimodal Web Navigation ( Poster ) > link | Hiroki Furuta 路 Ofir Nachum 路 Kuang-Huei Lee 路 Yutaka Matsuo 路 Shixiang Gu 路 Izzeddin Gur 馃敆 |
Fri 6:20 a.m. - 6:25 a.m.
|
Q&A
(
Q&A
)
>
|
Hiroki Furuta 路 Ofir Nachum 路 Kuang-Huei Lee 路 Yutaka Matsuo 路 Shixiang Gu 路 Izzeddin Gur 馃敆 |
Fri 6:25 a.m. - 6:35 a.m.
|
SemDeDup: Data-efficient learning at web-scale through semantic deduplication ( Poster ) > link | Amro Kamal 路 Kushal Tirumala 路 Daniel Simig 路 Surya Ganguli 路 Ari Morcos 馃敆 |
Fri 6:35 a.m. - 6:40 a.m.
|
Q&A
(
Q&A
)
>
|
Amro Kamal 路 Kushal Tirumala 路 Daniel Simig 路 Surya Ganguli 路 Ari Morcos 馃敆 |
Fri 6:43 a.m. - 6:45 a.m.
|
Coffee Break
|
馃敆 |
Fri 6:45 a.m. - 7:15 a.m.
|
Injecting large models with new modalities for Video Understanding
(
Invited Talk
)
>
SlidesLive Video |
Arsha Nagrani 馃敆 |
Fri 7:20 a.m. - 7:50 a.m.
|
Towards Structured Multimodal Representations
(
Invited Talk
)
>
SlidesLive Video |
Siddharth N 馃敆 |
Fri 7:50 a.m. - 8:00 a.m.
|
Coffee Break
|
馃敆 |
Fri 8:00 a.m. - 8:45 a.m.
|
The Perks and Pitfalls of MRL
(
Panel
)
>
SlidesLive Video |
Arsha Nagrani 路 Luca Moschella 路 Paul Pu Liang 路 Siddharth N 路 Valentino Maiorca 馃敆 |
Fri 8:45 a.m. - 9:00 a.m.
|
Closing Remarks
(
Closing
)
>
SlidesLive Video |
馃敆 |
-
|
Text-to-Image Diffusion Models are Zero-Shot Classifiers ( Poster ) > link | Kevin Clark 路 Priyank Jaini 馃敆 |