Multitask Multilingual Multimodal Pre-training
This paper presents a Multitask Multilingual Multimodal Pre-trained model (M3P) that combines multilingual-monomodal pre-training and monolingual-multimodal pre-training into a unified framework via multitask learning and weight sharing. The model learns universal representations that can map objects that occurred in different modalities or expressed in different languages to vectors in a common semantic space. To verify the generalization capability of M3P, we fine-tune the pre-trained model for different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that M3P can (i) achieve comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtain new state-of-the-art results on non-English multimodal tasks in the zero-shot or few-shot setting. We also build a new Multilingual Image-Language Dataset (MILD) by collecting large amounts of (text-query, image, context) triplets in 8 languages from the logs of a commercial search engineREAD FULL TEXT VIEW PDF
Multitask Multilingual Multimodal Pre-training
Recently, we witness the rise of a new paradigm of natural language processing (NLP), where general knowledge is learned from raw texts by self-supervised pre-training and then applied to downstream tasks by task-specific fine-tuning. Now, these state-of-the-art monolingual pre-trained language models, such as BERT, have been expanded to multilingual scenarios, such as Multilingual BERT , XLM/XLM-R [2, 3], Unicoder , and multimodal scenarios, such as ViLBERT , Unicoder-VL , UNITER , VLP , Oscar . However, it is still challenging to extend these pre-trained models to multilingual-multimodal scenarios due to the lack of large amounts of aligned multimodal corpora in multiple languages for multilingual-multimodal pre-training. As a result, many multilingual pre-trained models cannot handle vision data (e.g. images and videos) whereas many multimodal pre-trained models, which are trained using texts mainly in English, cannot handle multiple languages.
To address this challenge, this paper presents a Multitask Multilingual Multimodal Pre-trained model (MP), which aims to learn universal representations that can map objects occurred in different modalities or expressed in different languages to vectors in a common semantic space. This goal is achieved by (i) learning to represent multilingual data using multilingual corpora (i.e. sentences from Wikipedia covering 100 languages) by multilingual-monomodal pre-training, (ii) learning to represent multimodal data using multimodal corpora (i.e. image-caption pairs labeled in English) by monolingual-multimodal pre-training, and (iii) generalizing these representations to deal with multilingual-multimodal tasks by multi-task learning and weight sharing.
To verify the generalization capability of MP, we fine-tune the pre-trained model on different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that MP (i) achieves comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtains new state-of-the-art results on non-English multimodal tasks in the zero-shot or few-shot setting. To further evaluate the learned multilingual multimodal representations in more languages, we also build a new Multilingual Image-Language Dataset (MILD) which includes (text-query, image, context) triplets in 8 languages, collected from the logs of a commercial search engine. Different from other widely-used image-language datasets such as MSCOCO and Flickr30K, the texts in MILD are shorter and contain more entities, which makes the image-language task defined on this dataset (such as image-text retrieval) much more challenging. We will release MILD as a new benchmark to facilitate multilingual multimodal research.
Multilingual BERT (M-BERT)  demonstrates that by performing masked language modeling on a multilingual corpus with shared vocabulary and weights for 102 languages, surprisingly good results can be achieved on the cross-lingual natural language inference (XNLI)  task in 15 languages. XLM  and Unicoder  further improve the multilingual BERT by introducing new pre-training tasks based on a bilingual corpus. XLM-R  shows that by performing masked language modeling on a large-scale multilingual corpus, new state-of-the-art results on XNLI, MLQA and NER can be obtained. mBART  and Unicoder described in XGLUE  extend the multilingual models to multilingual text generation tasks based on the encoder-decoder framework and use different denoising auto-encoding pre-training tasks. However, all such models work for NLP tasks only, and cannot be applied to multimodal tasks such as image captioning.
Recently, a large number of multimodal pre-trained models, such as ViLBERT , Unicoder-VL , UNITER , VLP  and Oscar , are developed for vision-language tasks using multi-layer Transformer as the backbone. These models are pre-trained using similar visual-linguistic tasks and achieve comparable results on many vision-language tasks, such as visual question answering, visual commonsense reasoning, image-text retrieval and image captioning. However, as it is not easy to collect well-aligned visual-linguistic training data in multiple languages, all these models are pre-trained for English only based on monolingual multimodal corpora, such as Conceptual Captions , SBU Captions , Visual Genome  and MSCOCO , and cannot be applied to multimodal tasks with non-English inputs.
Multimodal machine translation is a task that includes multilingual and multimodal factors at the same time.  proposes a multitask-learning-based method to learn a multimodal translation model and to link visual semantics with the corresponding textual semantics at the same time. 
proposes a multimodal simultaneous neural machine translation method, which leverages visual information as an additional input and verifies its importance for simultaneous translation. However, due to the low-resource issue, these models are usually trained using very small amounts of (image, source caption, target caption translation) triples.
This section describes how to train MP using a multilingual-monomodal corpus (e.g. sentences extracted from Wikipedia) and a monolingual-multimodal corpus (e.g. English image-caption pairs). The MP model uses the model architecture of BERT  for understanding tasks and a BERT-based encoder-decoder architecture for generation tasks. We pre-train MP via multitask learning for optimizing a set of understanding and generation tasks, as shown in Figure 1.
Given an input image, we obtain its image region sequence using Faster-RCNN , where denotes the image region, denotes the length of v. The region embedding of is the visual feature outputted by Faster-RCNN. The spatial embedding of is a 5-D vector based on its normalized top-left and bottom-right coordinates and the fraction of the image area covered. We project these two embeddings into the text embedding space using two fully-connected (FC) layers. The final input representation of each image region is obtained by summing its projected region embedding and spatial embedding. We also keep the most possible object category of each image region predicted by Faster-RCNN, which will be used in the pre-training procedure.
Given an input text, we obtain its BPE token sequence using Sentence Piece , where denotes the BPE token, denotes the length of , denotes a language in the language set . The final input representation of each BPE token is obtained by summing its token embedding and position embedding. Moreover, a language embedding  is added to each input token to indicate different languages during generation. We use the same vocabulary as XLM-R , which includes 250K BPE tokens and covers 100 languages.
, this task performs masked language model based on the multilingual corpus. At each iteration, a batch is composed of sentences sampled from different languages. The sampling probability of a languageis defined as , where is the percentage of in the entire multilingual corpus, the smoothing factor is set to 0.3. For each batch, we randomly sample 15% of the words and replace them with (i) a special symbol [MASK], (ii) a random token, or (iii) keep them unchanged with probability 80%, 10% and 10%, respectively. A bilingual corpus can be used to further improve the multilingual pre-training [2, 4]. But this paper uses multilingual corpus only, as it is nontrivial to collect a bilingual corpus for 100 languages.
. We follow the same masking strategy used in MMLM to mask tokens in the input caption. The loss function is defined as:
where denotes the whole image-caption pairs, denotes the input caption is in English.
This task aims to reconstruct each masked image region based on the remaining regions and all the caption tokens . We randomly mask image regions with a probability of 15%. The input representation of each masked image region is set to zeros or keeps the original values with probability 90% and 10%, respectively. The loss function is defined as:
where enumerates the indices of all masked image regions. denotes the mean-square-error loss that tries to regress the Transformer output of each masked region to its visual feature . We apply an FC layer to convert the Transformer output of each masked region into a vector of same dimension as the visual feature . denotes the cross-entropy loss that tries to predict the object category of each masked region . We apply another FC layer to convert the Transformer output of each masked region to predict the scores of
object classes, which further go through a softmax function to be transformed into a normalized distribution. We take the predicted object category with the highest confidence score outputted by Faster-RCNN as the ground-truth label of , and convert it into a one-hot vector . Due to the top-1 category predicted by Faster-RCNN is not always correct, we leave minimizing the KL divergence between two distributions for our future work.
This task aims to learn the instance-level alignment between texts and images. An FC layer is applied on the Transformer output of [CLS] to predict whether the input image v and the input text are semantically matched. Negative image-caption pair are created by replacing the image or text in a matched sample with a randomly-selected image or text from other samples. The loss function is defined as:
where indicates whether the input image-text pair is matched or not.
This task aims to predict the original BPE token sequence based on its corrputed form , which is a noising function that corrupts by performing the following three operations sequentially: (1) shuffle by adding a noise to the input indices and then re-ordering based on the rank of the noised indices; (2) drop words with a probability of 30%; (3) sample a number of token spans from
with span lengths drawn from a Poisson distribution (), and then replace each token span with a single [MASK] token. Here, 0-length spans correspond to the insertion of [MASK] tokens. The loss function is defined as:
This task aims to generate the caption based on the image region sequence v detected from the input image. The loss function is defined as:
Given the image region sequence v detected from an input image, this task aims to generate the caption of the input image based on , which is a noising function that corrputs v
by sampling n-gram regions fromv and then replacing each n-gram region with a zero-initialized vector. The span lengths drawn from a Poisson distribution (). The loss function is defined as:
where denotes the token sequence already generated before the time step .
To further evaluate the learned multilingual multimodal representations in more languages, we build MILD as a new multilingual image-text dataset covering 8 languages, including English (en), German (de), French (fr), Portuguese (pt), Spanish (es), Italian (it), Japanese (ja) and Chinese (zh). Besides, this dataset also includes contexts of the images, and we would like to evaluate our model both with and without the context present. We construct the dataset in 5 steps.
Step-1: We collect billions of image-text pairs from the logs of a commercial image search engine.Each text is a user query in one of the eight languages (en, de, fr, pt, es, it, ja, zh). Each image is clicked by a user query.
: We perform image-based filtering by (i) discarding low-quality images whose width or height is smaller than 300 pixels; (ii) discarding sensitive images with pornographic or racy content; (iii) applying a binary classifier to filter images whose image feature cannot be reliably extracted.
: We perform text-based filtering by (i) discarding sensitive queries with pornographic or racy intent; (ii) using heuristic rules to remove queries with noisy words or numbers; (iii) discarding short queries whose lengths are less than 5 words.
Step-4: We use an in-house image-text semantic model to predict a relevance score for each query-image pair. This semantic model is trained on millions of human-labeled instances using text features, image features and image-text similarity features. Based on the relevance scores, we keep at most 5 queries for each image, following MSCOCO and Flickr30k. We also include the original title of each image as its context information, which is extracted from the HTML of the web page where the image comes from.
Step-5: We sample a portion of (query (Q), image (I), context (C)) triples generated in Step-4 to form MILD. Table 1 shows the statistics of MILD.
MILD differs from existing image-text benchmarks in three aspects: (1) The average query length in MILD is 5.8, which is shorter than 10.6 in MSCOCO and 12.3 in Flickr30K. This makes the image-text retrieval task on MILD harder, as the text caption is too brief to describe all elements occurred in the ground-truth image; (2) A portion of captions in MILD contain named entities such as person, location and organization names. For example, there are 39.2% English queries containing entities (PER, LOC, ORG, DATE, PROD, EVENT or ZIP) and the number for English context is 54.6%. It leaves a big room for future models to increase their performance on this dataset by introducing new mechanisms to handle these entities; (3) Each image has an additional context text, which is extracted from the web page from which the image comes. Based on human evaluation on the sampled image-query pairs, 80% of the pairs in MILD are matched pairs in that the query is a plausible caption of its paired image. Figure 2 gives some examples in MILD.
We use raw sentences extracted from the Wikipedia dump as the multilingual corpus for multilingual monomodal pre-training. It includes 101G sentences covering 100 languages. We use Conceptual Captions  as the multimodal corpus for monolingual multimodal pre-training. It contains 3.3 million English image-caption pairs harvested from the Web.
For understanding tasks, we set the hyper-parameters as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 128 max input length, 12 layers in encoder. In the pre-training stage, we initialize MP with XLM-R  and run continue pre-training with xMLM, MMLM, MRM and VLM. We use Adam Optimizer  with a linear warm-up  and set the learning rate to 1e-4. The total batch size is 1,024 after gradient accumulation. The pre-training stage takes about 4 days to converge on 8x V100 GPUs. In the fine-tuning stage, the batch size is set to 512 and sample 3 negative cases in VLM. We use Adam Optimizer with , and 5e-5 learning rate.
For generation tasks, we utilize the encoder-decoder architecture with 768 hidden units, 8 heads, GELU activations, a dropout rate of 0.1, 128 max input length, 12 layers in both encoder and decoder. The transformer parameters between encoder and decoder are shared, including embedding and self-attention module. In the pre-training stage, we train MP with xDAE, IC and DIC. The batch size is 1,536 with gradients accumulation and the initial lr is 1e-4 with a linear warm-up. In the fine-tuning stage, we reduce the lr to 5e-5 with a total batch size of 512. We feed the same language ID into encoder and decoder except for Multimodal Machine Translation. We set beam size as 10 in caption inference.
|Results without pre-training|
|PAR. EmbN ||69.0||62.6||60.6||54.1||78.3||76.0||74.8|
|Results with monolingual multimodal pre-training|
|Unicoder-VL (w/o fine-tune) ||72.0||-||-||-||63.7||-||-|
|Unicoder-VL (w/ fine-tune on en) ||88.1||-||-||-||89.2||-||-|
|MP (w/o fine-tune)||61.1||35.7||24.7||26.4||62.1||32.1||33.3|
|MP (w/ fine-tune on en)||86.0||48.8||39.4||38.8||87.4||54.4||55.8|
|MP (w/ fine-tune on each)||86.0||80.2||67.1||66.2||87.4||83.9||77.4|
|MP (w/ fine-tune on all)||86.7||82.0||73.5||70.2||88.0||86.8||81.8|
|Results based on <Q,I> pairs|
|MP (w/ fine-tune on en)||19.0||6.1||5.7||5.3||4.5||5.0||13.5||3.3||7.8|
|MP (w/ fine-tune on each)||19.0||7.7||7.7||9.8||7.7||8.1||19.0||11.3||11.3|
|MP (w/ fine-tune on all)||19.6||7.8||7.8||9.1||7.6||8.2||19.8||11.2||11.4|
|Results based on <Q,I,C> triples|
|MP (w/ fine-tune on en)||81.6||51.0||52.8||47.7||47.4||47.8||73.0||50.4||56.5|
|MP (w/ fine-tune on each)||81.6||54.5||56.6||52.7||52.3||51.4||75.4||58.2||60.3|
|MP (w/ fine-tune on all)||81.7||54.6||56.7||52.8||53.2||52.0||77.3||58.6||60.9|
The task of multilingual image-text retrieval is to find the most relevant images given input texts in different languages, or vice versa. We evaluate MP on Multi30K [28, 29], MSCOCO [16, 30, 31] and MILD. Multi30K extended Flickr30K  to German (de), French (fr) and Czech (cs). It contains 31,783 images and provides 5 captions per image in English and German and 1 caption per image in French and Czech. We use the train, dev, test splits as defined in . MSCOCO contains 123,287 images and provides 5 captions per image in English, but fewer in Chinese and Japanese. STAIR Captions extended MSCOCO  with 820K Japanese captions for COCO images.  extended MSCOCO  with Chinese captions for 20K images. We use the same train, dev, test splits for English and Japanese as defined in . As for Chinese, we use the COCO-CN split 
. We use mean Recall (mR) as the metric, which is an average score of Recall@1, Recall@5 and Recall@10 on image-to-text retrieval and text-to-image retrieval tasks.
Table 2 shows the evaluation results on Multi30K and MSCOCO, where MP achieves the state-of-the-art results comparing to several related work [26, 23, 24, 25, 27]. We study the impacts of different fine-tuning strategies, including w/o fine-tune: apply MP to all test sets directly without fine-tuning; w/ fine-tune on en: fine-tune MP for English and then apply the fine-tuned model to all test sets; w/ fine-tune on each: fine-tune MP for each language and then apply the fine-tuned model to the test sets of ; w/ fine-tune on all: fine-tune MP for all languages using the merged labeled data and then apply the fine-tuned model to all test sets. Similar to the observations reported in Unicoder [4, 12], the last two fine-tuning methods can lead to the best results. The same sentence in different languages may capture complementary information to help improve performance. We also compare with Unicoder-VL , which is pre-trained using the same image-caption corpus (i.e. Conceptual Captions), but for English only. Although MP performs a bit worse than Unicoder-VL on English, it can obtain comparable results on all the other languages, which verifies its strong transfer capability. A possible reason is that the employment of xMLM task and a larger vocabulary for 100 languages. In particular, SMALR  take advantage of machine translation to augment Multi30K and MSCOCO. Considering that applying machine translation to translate English to all other supported languages is not general and limited to a large amount of translators, we leave this as an option for future work.
Table 3 shows the evaluation results on MILD. The first batch of results is based on Q-I pairs without using image contexts. Comparing to the results on Multi30K and MSCOCO, the numbers on MILD are much lower, which shows it is a harder dataset. The second batch of results is based on Q-I-C triples, where each image and its context always appear together as input. Evaluation results show that such context information helps a lot in the image-text retrieval tasks in MILD.
|VLP (w/ fine-tune on en) ||30.1/67.4||-/-||-/-||-/-||36.5/116.9||-/-||-/-|
|XGPT (w/ fine-tune on en) ||31.8/70.9||-/-||-/-||-/-||37.2/120.1||-/-||-/-|
|MP (w/ fine-tune on each)||26.1/57.2||16.1/43.8||7.5/36.1||4.0/28.5||33.7/111.5||40.2/105.1||39.7/109.2|
|MP (w/ fine-tune on all)||26.5/59.4||16.6/44.3||8.7 /40.1||5.4/31.1||33.9/112.3||40.9/109.7||40.2/111.3|
The task of multilingual image captioning is to generate captions in specific languages given input images. We evaluate MP on Multi30K and MSCOCO. We use BLEU-4 (B@4) and CIDEr (C) as the metrics. Table 4 shows the evaluation results on Multi30K. Similar to Table 2, MP still performs worse than state-of-the-art pre-trained models (VLP and XGPT) on the English image captioning dataset. They employ the same image-caption corpus for pre-training. But it shows a strong cross-lingual transfer capability on non-English datasets in the few-shot settings (i.e., w/ fine-tune on each and w/ fine-tune on all).
|Text-Only NMT ||53.5||-||31.6||-|
The task of multimodal machine translation is to generate sentences in target languages given source sentences together with related images as complementary information. We evaluate MP on Multi30K and use BELU-4 (B@4) as the metrics. We experiment with our model in four translation directions consisting of 3 languages: English (en), German (de), French (fr). All language pairs include en on either of the sides. From Table 5, we compare the performance of MP against the state-of-the-art multimodal machine translation approaches and the text-only baseline. We observe that pretraining provides a significant boost in the BLEU score for each translation direction.
|XLM-R  (w/ fine-tune on en)||84.6||78.2||79.2||77.0||75.9||77.5||75.5||72.9||72.1||74.8||71.6||73.7||69.8||64.7||65.1||74.2|
|MP (w/ fine-tune on en)||82.3||76.3||77.0||74.1||73.2||76.2||74.1||70.3||69.2||73.9||69.6||72.9||68.6||59.4||64.7||72.1|
|Unicoder  (w/ fine-tune on en)||15.6||9.0||8.7||6.8||7.7||9.6|
|Unicoder  (w/ fine-tune on en)||15.8||11.9||9.9||7.5||8.4||10.7|
|MP (w/ fine-tune on en)||14.1||8.0||7.3||5.2||6.1||8.1|
The task of multilingual natural language inference is to predict the entailment relation (Entailment, Contradiction or Neutral) between two sentences in a specific language. We evaluate MP on XNLI  based on its original train, dev and test splits, and compare it with the base version (12 layers) of XLM-R . We fine-tune these two models on the English labeled data and then apply the fine-tuned model to all test sets in 15 languages. Evaluation results are listed in Table 6. From Table 6 we can see that, although MP is pre-trained for different types of tasks (understanding and generation) from different perspectives (multilingual and multimodal), it can still obtain surprisingly good performance on XNLI, which shows the possibility of learning universal representations.
We also evaluate MP on the News Title Generation (NTG) tasks in XGLUE , and compare it with the extended version of Unicoder described in . We fine-tune these two models on the English labeled data and then apply the fine-tuned model to all test sets in 5 languages. Evaluation results are listed in Table 7. Similar to the trend on XNLI, Table 7 shows that MP can keep a good performance on this multilingual text generation task as well.
We have presented in this paper a new pre-trained model MP for multilingual-multimodal representation learning. The learned representation shows a strong cross-lingual transfer capability and is proven effective on five downstream tasks. To facilitate the research on multilingual-multimodal modeling, we also develop a large-scale dataset called MILD and will make it publicly available to the research community.
International Journal of Computer Vision, 123(1):32–73, 2017.
Learning two-branch neural networks for image-text matching tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):394–407, 2018.
In: AAAI Conference on Artificial Intelligence (2020), 2020.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.