In the broader AI research community, Wikipedia data has been utilized as part of the training datasets for (multilingual) language models like BERT for many years. However, its content is still a largely untapped resource for vision and multimodal learning systems.Aside from a few recent cases, most vision and language efforts either work on narrow domains and small vocabularies and/or are available for English only, thus limiting the diversity of perspectives and audiences incorporated by these technologies. Recently, we see methods leveraging large data for multi-modal pretraining, and Wikipedia is one of the few open resources central to that effort.With this workshop, we propose to offer a space to bring together the community of vision, language and multilingual learning researchers, as well as members of the Wikimedia community, to discuss how these two groups can help and support each other. We will explore existing aspects and new frontiers of multilingual understanding of vision and language, focusing on the unique nature of Wikimedia’s mission: to bring free knowledge to the whole world equally.Beside invited talks and panel discussions, our workshop will present the winning entries of an ongoing Wikimedia-led, large-scale challenge on multilingual, multimodal image-text retrieval. Using the publicly available Wikipedia-based ImageText (WIT) dataset which contains 37 Million image-text sets across 108 languages, we will be presenting the benchmark and the top methods along a disaggregated set of performance, fairness, and efficiency metrics.
Fri 3:00 a.m. - 3:20 a.m.
|
Opening Remarks
(
Opening remarks
)
Introducing the workshop |
🔗 |
Fri 3:20 a.m. - 4:20 a.m.
|
Session: Open Data
(
Keynotes and QA
)
20-minute talks by Omar Sanseverio and Hady Elsahar followed by joint Q & A |
Omar Sanseviero · Hady Elsahar 🔗 |
Fri 4:20 a.m. - 5:20 a.m.
|
Session: Multimodality and Multilinguality - 1
(
Keynotes and QA
)
20 minute keynotes by Lucia Specia and Preethi Jyothi |
Lucia Specia · Preethi Jyothi 🔗 |
Fri 5:20 a.m. - 6:00 a.m.
|
Ask a Wikipedian & Poster Session
(
posters and breakout rooms
)
|
Isaac Johnson · Emily Lescak · Byungsoo Ko · Geonmo Gu · Nicola Messina · Davide Alessandro Coccomini · Fabrizio Falchi · Andrea Esuli 🔗 |
Fri 5:20 a.m. - 6:00 a.m.
|
Ask a Wikipedian group 2
(
posters
)
Meeting spot 2 in GatherTown for the Ask a Wikipedian discussions |
🔗 |
Fri 5:20 a.m. - 6:00 a.m.
|
Ask a Wikipedian group 3
(
posters
)
Meeting spot 3 in GatherTown for the Ask a Wikipedian discussions |
🔗 |
Fri 6:00 a.m. - 7:00 a.m.
|
Session: Wikimedia and the community
(
Keynotes and QA
)
Keynotes by Leila Zia, Andrew Lih, and Caroline Becker |
Leila Zia · Andrew Lih · Caroline Becker 🔗 |
Fri 7:00 a.m. - 8:00 a.m.
|
Panel: Multilinguality in multimodal research and open data
(
Panel
)
Panel on "Multilinguality in multimodal research and open data" with Preethi Jyothi, Michael Running Wolf, Jason Baldridge, Omar Sanseviero, and Margaret Mitchell Moderated by: Lucie Kaffee |
🔗 |
Fri 8:00 a.m. - 9:00 a.m.
|
Panel: How can Wikimedia and CV/ML communities learn from each other?
(
Panel
)
Panel on "How can Wikimedia and CV/ML communities learn from each other?" with Lucia Specia, Caroline Becker, Leila Zia, Marc Najork, Hady Elsahar Moderated by: Andrew Lih |
🔗 |
Fri 9:00 a.m. - 10:00 a.m.
|
Session: Wikipedia Image/Caption Matching Competition
(
Live presentations and Q&A
)
|
Miriam Redi · Krishna Srinivasan · Zhao He · Peng Lu · miaou miaou · Fabrizio Falchi · Nicola Messina · Andrea Esuli · Davide Alessandro Coccomini 🔗 |
Fri 10:00 a.m. - 10:10 a.m.
|
Multimodality and large-scale vision
(
Talk
)
SlidesLive Video » |
Tom Duerig 🔗 |
Fri 10:10 a.m. - 10:20 a.m.
|
Florence-VL overview
(
Talk
)
SlidesLive Video » |
Lijuan Wang 🔗 |
Fri 10:20 a.m. - 10:30 a.m.
|
Multitask and Reliable Vision and Language Models
(
Talk
)
|
Marcus Rohrbach 🔗 |
Fri 10:30 a.m. - 11:00 a.m.
|
Secrets of large-scale vision and language model pre-training
(
Panel
)
Panel and Q&A with Tom Duerig (Google), Lijuan Wang (Microsoft), Marcus Rohrbach (Meta AI Research). Moderated by Yannis Kalantidis. |
Tom Duerig · Lijuan Wang · Marcus Rohrbach 🔗 |
Fri 11:00 a.m. - 12:00 p.m.
|
Biases in AI an indeginous data sovereignty
(
Keynotes and QA
)
20 minutes keynotes by Michael Running Wolf and Margaret Mitchell. |
Michael Running Wolf · Margaret Mitchell 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Session: Multimodality and Multilinguality - 2
(
Keynotes and Q&A
)
20 minute keynotes by Jason Baldridge |
Jason Baldridge 🔗 |
Fri 12:30 p.m. - 12:35 p.m.
|
Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching
(
Oral
)
SlidesLive Video » With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage.In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time.Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge. |
Nicola Messina · Davide Alessandro Coccomini · Fabrizio Falchi · Andrea Esuli 🔗 |
Fri 12:35 p.m. - 12:40 p.m.
|
Large-scale Bilingual Language-Image Contrastive Learning
(
Oral
)
SlidesLive Video » This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words.; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP. |
Byungsoo Ko · Geonmo Gu 🔗 |
Fri 12:40 p.m. - 12:45 p.m.
|
Considerations for Multilingual Wikipedia Research
(
Oral
)
English Wikipedia has long been an important data source for much research and natural language machine learning modeling. The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more language editions of Wikipedia in datasets and models. Building better multilingual and multimodal models requires more than just access to expanded datasets; it also requires a better understanding of what is in the data and how this content was generated. This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia and how that might affect their models. It details three major ways in which content differences between language editions arise (local context, community and governance, and technology) and recommendations for good practices when using multilingual and multimodal data for research and modeling. |
Isaac Johnson · Emily Lescak 🔗 |
Fri 12:45 p.m. - 1:00 p.m.
|
Papers Q&A
(
Q&A
)
|
🔗 |
Fri 1:00 p.m. - 1:20 p.m.
|
Closing remarks
Closing remarks |
🔗 |