Following deep learning, multimodal machine learning has made steady progress, becoming ubiquitous in many domains. Learning representations from multiple modalities can be beneficial since different perceptual modalities can inform each other and ground abstract phenomena in a more robust, generalisable way. However, the complexity of different modalities can hinder the training process, requiring careful design of the model in order to learn meaningful representations. In light of these seemingly conflicting aspects of multimodal learning, we must improve our understanding of what makes each modality different, how they interact, and what are the desiderata of multimodal representations. With this workshop, we aim to bring the multimodal community together, promoting work on multimodal representation learning that provides systematic insights into the nature of the learned representations, as well as ways to improve and understand the training of multimodal models, both from a theoretical and empirical point of view.In particular, we focus on the following questions:(Representation) How do we identify useful properties of multimodal representations?(Training) How can we promote useful properties of multimodal representations?(Modalities) What makes a modality different? How can we improve their interactions?The MRL workshop has an objective to bring together experts from the multimodal learning community in order to advance these fundamental questions and discuss the future of the field. We invite submissions that present analysis of the properties of multimodal representations, insights on interactions across modalities, as well as novel applications regarding the nature and number of modalities employed.
Fri 12:00 a.m. - 12:10 a.m.
|
Introduction and Opening Remarks
(
Intro
)
SlidesLive Video » |
🔗 |
Fri 12:10 a.m. - 12:40 a.m.
|
Foundations of Multimodal Machine Learning: Principles, Challenges, and Open Questions
(
Invited Talk
)
SlidesLive Video » Multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this talk is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy. |
Paul Pu Liang 🔗 |
Fri 12:45 a.m. - 12:55 a.m.
|
Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance
(
Poster
)
link »
Recent years have witnessed astonishing advances in the field of multimodal representation learning, with contrastive learning being the cornerstone for major breakthroughs. Latest works delivered further improvements by incorporating different objectives such as masked modeling and captioning into the frameworks, but our understanding on how these objectives facilitate learning remains vastly incomplete. In this paper, we leverage the fact that classifier-guided diffusion models generate images that reflect the semantic signals provided by the classifier to study the characteristics of multimodal learning objectives. Specifically, we compare contrastive, matching and captioning loss in terms of their semantic signals, and introduce a simple baseline that not only supports our analyses but also improves the quality of generative guidance in a straightforward manner. |
Chaerin Kong · Nojun Kwak 🔗 |
Fri 12:55 a.m. - 1:00 a.m.
|
Q&A
|
Chaerin Kong · Nojun Kwak 🔗 |
Fri 1:00 a.m. - 1:10 a.m.
|
Hyperbolic Image-Text Representations
(
Poster
)
link »
SlidesLive Video » Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept 'dog' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic manifolds have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while maintaining (or improving) CLIP's performance on standard transfer tasks like zero-shot classification, retrieval, and resource-constrained deployment. |
Karan Desai · Maximilian Nickel · Tanmay Rajpurohit · Justin Johnson · Shanmukha Ramakrishna Vedantam 🔗 |
Fri 1:11 a.m. - 1:20 a.m.
|
Coffee break
(
Q&A
)
|
🔗 |
Fri 1:20 a.m. - 2:00 a.m.
|
Compositionality and Abstraction in Multimodal Learning
(
Invited Talk
)
SlidesLive Video » |
Zeynep Akata 🔗 |
Fri 2:00 a.m. - 2:03 a.m.
|
Interpreting Multimodal Video Transformers Using Brain Recordings
(
Poster
)
link »
SlidesLive Video » Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from the input modalities still remains unclear. In this work, we present a promising approach for probing a multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. We use the brain recordings of subjects watching a popular TV show to interpret the integration of multiple modalities in a video transformer, before and after it is trained to perform a question-answering task that requires vision and language information. For the early and middle layers, we show that fine-tuning on the vision-language task does not improve the alignment in brain regions that are thought to support the integration of multimodal information over their pre-trained counterparts. We further show that the top layers of the fine-tuned model align substantially less with the brain representations, and yield better task performances than other layers, which indicates that the task may require additional information from the one available in the brain recordings. |
Tianai Dong · Mariya Toneva 🔗 |
Fri 2:03 a.m. - 2:06 a.m.
|
A Picture is Worth a Thousand Words: Language Models Plan from Pixels
(
Poster
)
link »
SlidesLive Video » Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments. Prior PLM based approaches for planning either assume observations are available in the form of text by a captioning model, reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways (such as a pre-trained affordance function). In contrast, we show that the PLM can accurately plan even when observations are directly encoded as input prompts for the PLM. We show this simple approach outperforms prior approaches in experiments on the ALFWorld and VirtualHome benchmarks. |
Anthony Z Liu · Lajanugen Logeswaran · Sungryull Sohn · Honglak Lee 🔗 |
Fri 2:06 a.m. - 2:09 a.m.
|
Dynamic Pretraining of Vision-Language Models
(
Poster
)
link »
SlidesLive Video » Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. While most models have taken the direction of scaling training to increasingly large models and datasets, in this paper, we propose a dynamic pretraining resampling approach which utilizes a variety of pretraining tasks, and which results in more sample-efficient models. We show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. We show that a single 330M param pretrained model using only smaller and publicly accessible image-language datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering. |
AJ Piergiovanni · Weicheng Kuo · Wei Li · Anelia Angelova 🔗 |
Fri 2:09 a.m. - 2:12 a.m.
|
CHiLS: Zero-shot Image Classification with Hierarchical Label Sets
(
Poster
)
link »
SlidesLive Video » Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification through their ability generate embeddings for each class based on their (natural language) names. Prior works focused on improving the accuracy of these models through prompt engineering or by finetuning with a small amount of labeled downstream data. However, there has been little focus on improving the richness of the class names themselves, which can pose issues when class labels are coarsely-defined and uninformative. We propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy for zero-shot classification specially designed for datasets with implicit semantic hierarchies. CHiLS proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets with underlying hierarchical structure, CHiLS improves accuracy in situations both with and without ground-truth hierarchical information. |
Zachary Novack · Saurabh Garg · Julian McAuley · Zachary Lipton 🔗 |
Fri 2:12 a.m. - 2:15 a.m.
|
Towards understanding the modality gap in CLIP
(
Poster
)
link »
SlidesLive Video » This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap. |
Peiyang Shi · Michael Welle · Mårten Björkman · Danica Kragic 🔗 |
Fri 2:18 a.m. - 3:00 a.m.
|
Poster Session
link »
Password: mrl_workshop_2023! |
🔗 |
Fri 3:00 a.m. - 4:30 a.m.
|
Lunch Break
|
🔗 |
Fri 4:30 a.m. - 5:10 a.m.
|
Learning Visual Features Enriched by Audio or Language
(
Invited Talk
)
link »
SlidesLive Video » Multimodal perception feature learning has great potential to unlock problems in video understanding, augmented reality, and embodied AI. I will present some of our recent work in learning with audio-visual (AV) and visual-language (VL) modalities. First, we explore how audio’s spatial signals can augment visual understanding of 3D environments. This includes ideas for self-supervised feature learning from echoes and AV floorplan reconstruction. Next, building on these spatial AV and scene acoustics ideas, we introduce new ways to enhance the audio stream – making it possible to transport a sound to a new physical environment observed in a photo, or to dereverberate speech so it is intelligible for machine and human ears alike. Throughout this line of work, we leverage our open-source SoundSpaces platform, which provides state-of-the-art rendering of highly realistic audio in real-world scanned environments, and thereby facilitates self-supervised AV learning. Finally, we propose a hierarchical video-language (VL) embedding that simultaneously learns to account for both the “what” (step-by-step activity) and the “why” (intention of the actor) in egocentric video. |
Kristen Grauman 🔗 |
Fri 5:10 a.m. - 5:13 a.m.
|
Using Multimodal DNNs to Localize Vision-Language Integration in the Brain
(
Poster
)
link »
SlidesLive Video » We leverage a large electrocorticography dataset consisting of neural recordings in response to movie viewing and a battery of unimodal and multimodal deep neural network models (SBERT, BEIT, SIMCLR, CLIP, SLIP) to identify candidate sites of multimodal integration in the human brain. Our data-driven method involves three steps: first, we parse the neural data into distinct event-structures defined either by word onset times, or visual scene cuts. We then use the activity generated by these event-structures in our candidate models to predict the activity generated in the brain. Finally, using contrasts between models with or without multimodal learning signals, we isolate those neural arrays driven more by multimodal representations than by unimodal representations. Using this method, we identify a sizable set of candidate neural sites that our model predictions suggest are shaped by multimodality (from 3%-29%, depending on increasingly conservative statistical inclusion criteria). We note a meaningful cluster of these multimodal neurons in and around the temporoparietal junction, long theorized to be a hub of multimodal integration. |
Vighnesh Subramaniam · Colin Conwell · Christopher Wang · Gabriel Kreiman · Boris Katz · Ignacio Cases · Andrei Barbu 🔗 |
Fri 5:13 a.m. - 5:16 a.m.
|
The Role of Pre-training Data in Transfer Learning
(
Poster
)
link »
SlidesLive Video » We explore which pre-training dataset should be used to achieve the best transfer learning performance. We investigate the impact of pre-training on the few-shot and full fine-tuning performance using 7 pre-training datasets, and 9 downstream datasets. Through extensive controlled experiments, we find that the choice of the pre-training dataset is essential for the few-shot transfer, but its role decreases as more data is made available for fine-tuning. Additionally, we explore the role of data curation and examine the trade-offs between label noise and the size of the pre-training dataset. We find that using 2000× more pre-training data from LAION can match the performance of supervised ImageNet pre-training. |
Rahim Entezari · Mitchell Wortsman · Olga Saukh · Moein Shariatnia · Hanie Sedghi · Ludwig Schmidt 🔗 |
Fri 5:16 a.m. - 5:19 a.m.
|
Multimodal Subtask Graph Generation from Instructional Videos
(
Poster
)
link »
SlidesLive Video »
Real-world tasks consist of multiple inter-dependent subtasks (e.g., a dirty pan needs to be washed before cooking). In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task. This is a challenging problem since complete information about the world is often inaccessible from videos, which demands robust learning mechanisms to understand the causal structure of events. We present Multimodal Subtask Graph Generation (MSG$^2$), an approach that constructs a Subtask Graph defining the dependency between a task’s subtasks relevant to a task from noisy web videos. Graphs generated by our multimodal approach are closer to human-annotated graphs compared to prior approaches. MSG$^2$ further performs the downstream task of next subtask prediction 85% and 30% more accurately than recent video transformer models in the ProceL and CrossTask datasets, respectively.
|
Yunseok Jang · Sungryull Sohn · Tiange Luo · Lajanugen Logeswaran · Moontae Lee · Honglak Lee 🔗 |
Fri 5:19 a.m. - 5:22 a.m.
|
Exploiting Category Names for Few-Shot Classification with Vision-Language Models
(
Poster
)
link »
SlidesLive Video » Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37\% on ImageNet and 96.08\% on Stanford Cars, both using five-shot learning). |
Taihong Xiao · Zirui Wang · Liangliang Cao · Jiahui Yu · Shengyang Dai · Ming-Hsuan Yang 🔗 |
Fri 5:22 a.m. - 5:25 a.m.
|
Classifier-free guidance makes image captioning models more descriptive
(
Poster
)
link »
SlidesLive Video »
Image captioning is conventionally formulated as the task of generating captions that are similar to a set of human-generated reference captions, as measured using evaluation metrics such as CIDEr, ROUGE, and BLEU. Recent work has also explored reference-free captioning metrics based on the distance between generated captions and the corresponding images in the embedding space of a contrastively-trained image-text model such as CLIP. Here, we show that it is possible to trade off between reference-free and reference-based captioning metrics by decoding from a single autoregressive captioning model using classifier-free guidance (Ho & Salimans, 2021). Compared to standard greedy decoding, decoding from the same model with a guidance scale of 2 substantially improves caption$\to$image retrieval performance when captions and images are embedded using CLIP (recall@1 44.3% vs. 26.6%) and marginally improves CLIPScore (0.786 vs. 0.761), but greatly worsens standard reference-based captioning metrics (e.g., CIDEr 79.9 vs 126.3). Manual inspection reveals that higher guidance scales produce more descriptive but less grammatical captions.
|
Simon Kornblith · Lala Li · Zirui Wang · Thao Nguyen 🔗 |
Fri 5:25 a.m. - 5:28 a.m.
|
Impossibility of Collective Intelligence
(
Poster
)
link »
SlidesLive Video » This work provides a minimum requirement in terms of intuitive and reasonable axioms under which an empirical risk minimization (ERM) is the only rational learning algorithm when learning in heterogeneous environments. We provide an axiomatization of any learning rule in terms of choice correspondences over a hypothesis space and seemingly primitive properties. Then, we show that the only feasible algorithm compatible with these properties is the standard ERM that learns arbitrarily from a single environment. This impossibility result implies that Collective Intelligence (CI), the ability of algorithms to successfully learn across heterogeneous environments, cannot be achieved without sacrificing at least one of these basic properties. More importantly, this work reveals an incomparability of performance metrics across environments as one of the fundamental limits in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning. |
Krikamol Muandet 🔗 |
Fri 5:28 a.m. - 6:10 a.m.
|
Poster Session
link »
Password: mrl_workshop_2023! |
🔗 |
Fri 6:10 a.m. - 6:20 a.m.
|
Instruction-Finetuned Foundation Models for Multimodal Web Navigation
(
Poster
)
link »
We propose an instruction-aligned multimodal agent for autonomous web navigation -- i.e., sequential decision making tasks employing a computer interface. Our approach is based on supervised finetuning of vision and language foundation models on a large corpus of web data consisting of webpage screenshots and HTML. Specifically, we use vision transformers on sequences of web page screenshots to extract patch-level image features. These features are concatenated with embedding of tokens in HTML documents. Using an instruction-finetuned large language model, we jointly encode both vision and HTML modalities and decode web actions such as click and type. We show that our method outperforms previous approaches by a significant margin, even in handling out-of-distribution HTML and compositional tasks. On the MiniWoB benchmark, we improve previous approaches using only HTML input by more than 17.7%, even surpassing the performance of RL-finetuned models. On the recent WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing state-of-the-art PaLM-540B. We also collect 347K gold demonstrations using our trained models, 29 times larger than prior work, and make them available to promote future research in this area. We believe that our work is a step towards building capable and generalist decision making agents for computer interface. |
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
Fri 6:20 a.m. - 6:25 a.m.
|
Q&A
|
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
Fri 6:25 a.m. - 6:35 a.m.
|
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
(
Poster
)
link »
Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove "semantic duplicates'': data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data. |
Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗 |
Fri 6:35 a.m. - 6:40 a.m.
|
Q&A
|
Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗 |
Fri 6:43 a.m. - 6:45 a.m.
|
Coffee Break
|
🔗 |
Fri 6:45 a.m. - 7:15 a.m.
|
Injecting large models with new modalities for Video Understanding
(
Invited Talk
)
SlidesLive Video » Large models have had an `explosion’ moment recently, achieving state of the art results across various benchmarks and tasks. Here we discuss how they can be adapted to novel vision and audio inputs for multimodal tasks, either by influencing model design, or as frozen components in multimodal architectures. We focus on multimodal video captioning tasks such as ASR and automatic AD for movies, and cover some recently accepted papers at CVPR 2023. |
Arsha Nagrani 🔗 |
Fri 7:20 a.m. - 7:50 a.m.
|
Towards Structured Multimodal Representations
(
Invited Talk
)
SlidesLive Video » Multimodal modelling has seen great interest in recent years with fantastic results and applicability over a wide range of tasks. A particular feature of such applicability has been the development of conditional generation, and the chaining of such conditional models to generate cross-modally. This however has meant that the question of representations, and what being cross-modal entails, has been eschewed in favour of high generative quality---something that leaves things as black-boxes from the perspective of human inspection and interpretability. In this talk, I will touch upon some recent and ongoing work in our lab towards learning unsupervised models that capture structured representations, which can be constrained across modalities to address questions of interpretability through multimodal grounding. |
Siddharth N 🔗 |
Fri 7:50 a.m. - 8:00 a.m.
|
Coffee Break
|
🔗 |
Fri 8:00 a.m. - 8:45 a.m.
|
The Perks and Pitfalls of MRL
(
Panel
)
SlidesLive Video » |
Arsha Nagrani · Luca Moschella · Paul Pu Liang · Siddharth N · Valentino Maiorca 🔗 |
Fri 8:45 a.m. - 9:00 a.m.
|
Closing Remarks
(
Closing
)
SlidesLive Video » |
🔗 |
-
|
Text-to-Image Diffusion Models are Zero-Shot Classifiers
(
Poster
)
link »
Text-to-image diffusion models have demonstrated remarkable generative capabilities, suggesting they learn informative representations of image-text data. However, their abilities are not fully understood and they have not been thoroughly explored on downstream tasks.We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers.The key idea is using a diffusion model's ability to denoise a noised image given a textual description of a label as a proxy for that label's likelihood.We apply our method to Imagen, using it to probe fine-grain aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it is more robust than CLIP and can successfully perform attribute binding while CLIP does not. Although generative pre-training is common in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for visual and vision-language problems. |
Kevin Clark · Priyank Jaini 🔗 |