WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

by   Krishna Srinivasan, et al.

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example.


page 4

page 9


Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

Multimodal learning is a recent challenge that extends unimodal learning...

Multilingual Multimodal Learning with Machine Translated Text

Most vision-and-language pretraining research focuses on English tasks. ...

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

This paper introduces a large-scale multimodal and multilingual dataset ...

Large-scale Bilingual Language-Image Contrastive Learning

This paper is a technical report to share our experience and findings bu...

UIBert: Learning Generic Multimodal Representations for UI Understanding

To improve the accessibility of smart devices and to simplify their usag...

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and...

GLAMI-1M: A Multilingual Image-Text Fashion Dataset

We introduce GLAMI-1M: the largest multilingual image-text classificatio...

Please sign up or login with your details

Forgot password? Click here to reset