Log In Sign Up

Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning

Recent work has highlighted the advantage of jointly learning grounded sentence representations from multiple languages. However, the data used in these studies has been limited to an aligned scenario: the same images annotated with sentences in multiple languages. We focus on the more realistic disjoint scenario in which there is no overlap between the images in multilingual image–caption datasets. We confirm that training with aligned data results in better grounded sentence representations than training with disjoint data, as measured by image–sentence retrieval performance. In order to close this gap in performance, we propose a pseudopairing method to generate synthetically aligned English–German–image triplets from the disjoint sets. The method works by first training a model on the disjoint data, and then creating new triples across datasets using sentence similarity under the learned model. Experiments show that pseudopairs improve image–sentence retrieval performance compared to disjoint training, despite requiring no external data or models. However, we do find that using an external machine translation model to generate the synthetic data sets results in better performance.


Lessons learned in multilingual grounded language learning

Recent work has shown how to learn better visual-semantic embeddings by ...

Image Pivoting for Learning Multilingual Multimodal Representations

In this paper we propose a model to learn multimodal multilingual repres...

Probing Multilingual Sentence Representations With X-Probe

This paper extends the task of probing sentence representations for ling...

Learning Joint Multilingual Sentence Representations with Neural Machine Translation

In this paper, we use the framework of neural machine translation to lea...

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation

Existing multilingual machine translation approaches mainly focus on Eng...

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

With the aim of promoting and understanding the multilingual version of ...

Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning

Recently there has been a lot of interest in learning common representat...

1 Introduction

The perceptual-motor system plays an important role in concept acquisition and representation, and in learning the meaning of linguistic expressions (Pulvermüller, 2005)

. In natural language processing, many approaches have been proposed that integrate visual information in the learning of word and sentence representations, highlighting the benefits of visually grounded representations

Lazaridou et al. (2015); Baroni (2016); Kiela et al. (2017); Elliott and Kádár (2017). In these approaches the visual world is taken as a naturally occurring meaning representation for linguistic utterances, grounding language in perceptual reality.

Recent work has shown that we can learn better visually grounded representations of sentences by training image–sentence ranking models on multiple languages Gella et al. (2017); Kádár et al. (2018). This line of research has focused on training models on datasets where the same images are annotated with sentences in multiple languages. This alignment has either been in the form of the translation pairs (e.g. German, English, French, and Czech in Multi30K Elliott et al. (2016)) or independently collected sentences (English and Japanese in STAIR Yoshikawa et al. (2017)).

In this paper, we consider the problem of training an image–sentence ranking model using image-caption collections in different languages with non-overlapping images drawn from different sources. We call these collections disjoint datasets and argue that it is easier to find disjoint datasets than aligned datasets. This is especially the case for datasets in different languages, e.g. digital museum collections111Europeana:, newspaper collections (Ramisa et al., 2017), or the the images used in Wikipedia articles (Schamoni et al., 2018). Multilingual aligned datasets, by contrast, are small and expensive to collect (Elliott et al., 2016): there is a need for methods that can train image–sentence ranking models on disjoint multilingual image datasets.

kadar2018conll claim that a multilingual image–sentence ranking model trained on disjoint datasets performs on-par with a model trained on aligned data. However, the disjoint datasets in their paper are artificial because they were formed by randomly splitting the Multi30K dataset into two halves. We examine whether the ranking model can benefit from multilingual supervision when it is trained using disjoint datasets drawn from different sources. In experiments with the Multi30K and COCO datasets, we find substantial benefits from training with these disjoint sources, but the best performance comes from training on aligned datasets.

Given the empirical benefits of training on aligned datasets, we explore two approaches to creating synthetically aligned training data in the disjoint scenario. One approach to creating synthetically aligned data is to use an off-the-shelf machine translation system to generate new image-caption pairs by translating the original captions. This approach is very simple, but has the limitation that an external system needs to be trained, which requires additional data.

The second approach is to generate synthetically aligned data that are pseudopairs. We assume the existence of image–caption datasets in different languages where the images do not overlap between the datasets. Pseudopairs are created by annotating the images of one dataset with the captions from another dataset. This can be achieved by leveraging the sentence similarities predicted by an image-sentence ranking model trained on the original image–caption datasets. One advantage of this approach is that it does not require additional models or datasets because it uses the trained model to create new pairs. The resulting pseudopairs can then be used to re-train or fine-tune the original model.

In experiments on the Multi30K and COCO datasets, we find that using an external machine translation system to create the synthetic data improves image–sentence ranking performance by 26.1% compared to training on only the disjoint data. The proposed pseudopair approach consistently improves performance compared to the disjoint baseline by 6.4%, and, crucially, this improvement is achieved without using any external datasets or pre-trained models. We expect that there is a broad scope for more complex pseudopairing methods in future work in this direction.

2 Method

We adopt the model architecture and training procedure of Kádár et al. (2018) for the task of matching images with sentences. This task is defined as learning to rank the sentences associated with an image higher than other sentences in the data set, and vice-versa Hodosh et al. (2013)

. The model is comprised of a recurrent neural network language model and a convolutional neural network image encoder. The parameters of the language encoder are randomly initialized, while the image encoder is pre-trained, frozen during training and followed by a linear layer which is tuned for the task. The model is trained to make true pairs

similar to each other, and contrastive pairs and

dissimilar from each other in a joint embedding space by minimizing the max-violation loss function

Faghri et al. (2017):


In our experiments, the pairs are either image-caption pairs or caption–caption pairs (following D17-1303,kadar2018conll). When we train on

pairs, we sample a batch from an image–caption data set with uniform probability, encode the images and the sentences, and perform an update of the model parameters. For the caption–caption objective, we follow

Kádár et al. (2018) and generate a sentence pair data set by taking all pairs of sentences that belong to the same image and are written in different languages: 5 English and 5 German captions result in 25 English-German pairs. The sentences are encoded and we perform an update of the model parameters using the same loss. When training with both the image–caption and caption–caption (c2c) ranking tasks, we randomly select the task to perform with probability =0.5.

2.1 Generating Synthetic Pairs

We propose two approaches to creating synthetic image–caption pairs to improve image–sentence ranking models when training with disjoint data sets. We assume the existence of datasets : , and : , consisting of image–caption pairs and in languages and , where the image sets do not overlap . We seek to extend , to a bilingual dataset with synthetic captions in language , resulting in a triplet data set , , consisting of triplets . We hypothesize that the new dataset will improve model performance because it will be trained to map the images to captions in both languages.

2.2 Pseudopairs approach

Given two image-caption corpora , and , with pairs and , we generate a pseudopair corpus labeling each image in with a caption from . We create pseudopairs only in one direction leading to new image–caption pairs .

The pseudopairs are generated using the sentence representations of the model trained on both corpora , and , jointly. We encode all captions and and for each find the most similar caption

using the cosine similarity between the sentence representations. This leads to pairs

and as a result to triplets .


Optionally we filter the resulting pseudopair set , in an attempt to avoid misleading samples with three filtering strategies:

  1. No filtering.

  2. Keep top: keep items with similarity scores in the 75% percentile; keep top 25%

  3. Remove bottom: keep items with similarity scores in the 25%; remove bottom 25%

Fine-tuning vs. restart

After the pseudopairs are generated we consider two options: re-train the model from scratch with all previous data sets adding the generated pseudopairs or fine-tunening with same data sets and the additional pseudopairs.

2.3 Translation approach

Given a corpus , with pairs , we use a machine translation system to translate each caption to a language leading to new image–caption pairs 222Li et al. (2016)

used a similar approach to create Chinese captions for images in the Flickr8K dataset, but they used the translations to train a Chinese image captioning model.

. Any off-the-shelf translation system could be used to create the translated captions, e.g. an online service or a pre-trained translation model.

3 Experimental Protocol

3.1 Model

Our implementation, training protocol and parameter settings are based on the existing codebase of kadar2018conll.333

. In all experiments, we use the 2048 dimensional image features extracted from the last average-pooling layer of a pre-trained


Trained on the ILSVRC 2012 1.2M image 1000 class object classification subset ImageNet

Russakovsky et al. (2015)
ResNet50 CNN He et al. (2016).

The image representation used in our model is obtained by a single affine transformation that we train from scratch

. For the sentence encoder we use a uni-directional Gated Recurrent Unit (GRU) network

Cho et al. (2014) with a single hidden layer with 1024 hidden units and 300 dimensional word embeddings. When training bilingual models we use a single word embedding for the same word-forms, making no distinction if they come from different languages. Each sentence is represented by the final hidden state of the GRU. For the similarity function in the loss function (Eq. 1) we use cosine similarity and margin parameter.

In all experiments, we early stop on the validation set when no improvement is observed for 10 inspections, which are performed every 500 updates. The stopping criterion is the sum of text-to-image (TI) and image-to-text (IT) recall scores at ranks 1, 5 and 10 across all languages in the training data. The models are trained with a batch-size of 128 with with the Adam optimizer Kingma and Ba (2014) using default parameters and an initial learning rate of 2e-4 without applying any learning-rate decay schedule. We apply gradient norm clipping with a value of 2.0.

We use a pre-trained OpenNMT Klein et al. (2018) English-German machine translation model555 to create the data for the translation approach described in Section 2.3.

3.2 Datasets

The models are trained and evaluated on the bilingual English-German Multi30K dataset (M30K), and we optionally train on the English COCO dataset Chen et al. (2015). In monolingual experiments, the model is trained on a single language from M30K or COCO.

In the aligned bilingual experiments, we use the independently collected English and German captions in M30K: The training set consists of 29K images and 145K captions; the validation and test sets have 1K images and 5K captions.

For the disjoint experiments, we use the COCO data set with the karpathy2015deep splits. This gives 82,783 training, 5,000 validation, and 5,000 test images; each image is paired with five captions. The data set has an additional split containing the 30,504 images from the original validation set of MS-COCO (“restval”), which we add to the training set as in previous work Karpathy and Fei-Fei (2015); Vendrov et al. (2016); Faghri et al. (2017).

3.3 Evaluation

We report results on Multimodal Translation Shared Task 2016 test split Specia et al. (2016) of M30K. Due to space constraints, we only report recall at 1 (R@1) for Image-to-Text (IT) and Text-to-Image (TI) retrieval, and the sum of R@1, R@5, and R@10 recall scores across both tasks and languages (Sum).666This is the criterion we use for early-stopping.

4 Baseline Results

The experiments presented here set the baseline performance for the visually grounded bilingual models and introduces the data settings that we will use in the later sections.


I T T I Sum
En 40.5 28.8 346.4
  + De 41.4 29.9 352.8
    + c2c 42.8 32.1 361.6
COCO 34.4 24.8 304.0
  + En 46.2 33.4 374.4
Table 1: Performance on the English M30K 2016 test set in the aligned setting for models trained on M30K English (En), both M30K German and English (+De), with caption ranking (+c2c), COCO (COCO) and both COCO and M30K English (+En).
I T T I Sum
De 34.9 24.6 311.2
  + En 38.6 26.0 324.6
    + c2c 38.3 27.7 334.0
  + COCO 36.4 25.7 319.7
Table 2: Performance on the German M30K 2016 test set in the aligned and disjoint settings for models trained on M30K German (De), both M30K German and English (+En), and with caption ranking (+c2c) and both M30K German and COCO (+COCO).
English German
I T T I Sum I T T I Sum
En+De+c2c 42.8 28.6 361.6 38.3 27.7 334.0
  + COCO 46.5 34.8 378.9 40.6 28.8 344.6
Table 3: Recall @ 1 and Sum-of-Recall-Scores for Image-to-Text (I T) and Text-to-Image (T I) baseline results on the English and German M30K 2016 test in the aligned plus disjoint setting

In these experiments we only use the aligned English-German data from M30K. Tables 1 and 2 present the result for English and German, respectively. The Sum-of-recall scores for both languages show that the best approach is the bilingual model with the c2c loss (En+De+c2c, and De+En+c2c). These results reproduce the findings of kadar2018conll.


We now determine the performance of the model when it is trained on data drawn from different data sets with no overlapping images.

First we train two English monolingual models: one on the M30K English dataset and one on the English COCO dataset. Both models are evaluated on image–sentence ranking performance on the M30K English test 2016 set. The results in Table 1 show that there is a substantial difference in performance in both text-to-image and image-to-text retrieval, depending on whether the model is trained on the M30K or the COCO dataset. The final row of Table 1 shows, however, that jointly training on both data sets improves over only using the M30K English training data.

We also conduct experiments in the bilingual disjoint setting, where we study whether it is possible to improve the performance of a German model using the out-of-domain English COCO data. Table 2 shows that there is an increase in performance when the model is trained on the disjoint sets, as opposed to only the in-domain M30K German (compare De against De+COCO). This result is not too surprising as we have observed both the advantage of joint training on both languages in the aligned setting and the overlap between the different datasets.

Finally, we compare the performance of a German model trained in the aligned and disjoint settings. We find that a model trained in the aligned setting (De+En) is better than a model trained in the disjoint setting (De+COCO), as shown in Table 2. This finding contradicts the conclusion of Kádár et al. (2018), who claimed that the aligned and disjoint conditions lead to comparable performance. This is most likely because the disjoint setting in kadar2018conll is artificial, in the sense that they used different 50% subsets of M30K. In our experiments the disjoint image–caption sets are real, in the sense that we trained the models on the two different datasets.

Aligned plus disjoint

Our final baseline experiments explore the combination of disjoint and aligned data settings. We train an English-German bilingual model with the c2c objective on M30K, and we also train on the English COCO data. Table 3 shows that adding the disjoint data improves performance for both English and German compared to training solely the aligned model.


First we reproduced the findings of Kádár et al. (2018) showing that bilingual joint training improves over monolingual and using c2c loss further improves performance. Furthermore, we have found that adding the COCO as additional training data both when only training on German, and training on both German-English from M30K improves performance even if the model is trained on data drawn from a different dataset.

I T T I Sum
De + COCO 36.4 25.7 319.7
+ pseudo 37.3 25.2 319.9
+ fine-tune 38.0 25.6 322.9
+ pseudo 25% 37.3 25.9 320.9
+ fine-tune 37.2 25.7 320.7
+ pseudo 75% 36.8 25.1 316.3
+ fine-tune 36.5 25.5 317.5
Table 4: A disjoint model is trained on the De+COCO datasets and used to generate pseudopairs. Then the full pseudopair set (+pseudo) or the filtered versions (+pseudo 25% and +pseudo 75%) are used as an extra data set to either re-train the moddel from scratch or fine-tune the original De+COCO model (+fine-tune).
English German
I T T I Sum I T T I Sum Sum(Sum)
En+De+COCO+c2c 46.5 34.8 378.9 40.6 28.8 344.6 723.5
+ pseudo 48.1 35.6 382.3 41.8 29.0 345.6 727.8
+ fine-tune 47.0 35.7 381.5 40.9 28.7 346.8 728.2
+ pseudo 25% 47.5 34.9 380.2 41.5 28.9 345.5 725.7
+ fine-tune 46.1 35.4 379.7 41.6 29.1 347.8 727.5
+ pseudo 75% 45.9 34.0 373.6 40.3 27.9 339.1 712.7
+ fine-tune 46.2 35.1 378.6 41.0 29.1 345.1 723.6
+ Translation 47.5 36.2 384.5 43.5 30.5 357.9 742.4
Table 5: We train the aligned plus disjoint model with c2c loss and add the full pseudopair set (+pseudo) or the filtered versions (+pseudo 25% and +pseudo 75%) is added as an extra data set. The model is either re-trained from scratch or fine-tuned (+fine-tune). We also report the result of training the aligned plus disjoint model with the synthetic translations (+Translation).

5 Training with Pseudopairs

In this section we turn our attention to creating a synthetic English-German aligned data set from the English COCO using the pseudopair method (Section 2.1). The synthetic data set is used to train an image-sentence ranking model either from scratch or by fine-tuning the original model; in addition, we also explore the effect of using all of the pseudopairs or by filtering the pseudopairs. We hypothesise that training a model with the additional pseudopairs with improve over the aligned plus disjoint baseline.


We generate pseudopairs using the disjoint bilingual model trained on the German M30K and the English COCO. Table 4 reports the results when evaluating on the M30K German data. Line 2 shows that using the full pseudopair set and re-training the model does not lead to noticeable improvements. However, line 3 shows that performance increases when we train with all pseudopairs and fine-tuning the original disjoint bilingual model. Filtering the pseudopairs at either the 25% and 75% percentile is detrimental to the final performance.777 We did not find any improvements in the disjoint setting when training with pseudopairs and the additional c2c loss.

Aligned plus disjoint

We generate pseudopairs using a model trained on M30K English-German data with the c2c objective and the English COCO data set. The results for both English and German are reported in Table 5; note that when we train with the pseudopairs we also train with the c2c loss on both data sets. Overall we find that pseudopairs improve performance, however, we do not achieve the best results for English and German in the same conditions. The best results for German are to filter at 25% percentile and apply fine-tuning, while for English the best results are without filtering or fine-tuning. The best overall model is trained with all the pseudopairs with fine-tuning, according to the Sum of the Sum-of-recall scores across both English and German. The performance across both data sets is increased from 723.5 to 728.2 using the pseudopair method.


In both aligned plus disjoint and disjoint scenarios, the additional pseudopairs improve performance, and in both cases the overall best performance is achieved when applying the fine-tuning strategy and no filtering of the samples.

I T T I Sum
De + COCO 36.4 25.7 319.7
+ Translation 37.7 26.3 327.2
+ c2c 39.9 26.7 335.5
Table 6: Results on the German M30K 2016 test set with the aligned plus disjoint (En+De+COCO+c2c) model, the additional automatically translated COCO (+Translation) and with the c2c on the synthetic pairs.

6 Training with Translations

We now focus on our second approach to creating an English-German aligned dataset using the translation method described in Section 2.1.


We first report the results of disjoint bilingual model trained on the German M30K, the English COCO data, and the translated German COCO in Table 6. The results show that retrieval performance is improved when the model is trained on the translated German COCO data in addition to the English COCO data. We find the best performance when we jointly train on the M30K German, the Translated German COCO and the English COCO with the additional c2c objective over the COCO datasets (+c2c). We note that this setup leads to a better model, as measured by the sum-of-recall-scores, than training on the aligned M30K data (compare De+COCO+Translation+c2c in Table 6 to De+En+c2c in Table 2).

Aligned plus Disjoint

In these experiments, we train models with the aligned M30K data, the disjoint English COCO data, and the translated German COCO data. Table 5 presents the results for the English and German evaluation. We find that training on the German Translated COCO data and using the c2c loss over the COCO data results in improvements for both languages.


In both the disjoint and aligned plus disjoint settings, we find that training with the translations of COCO improves performance over training with only the English COCO data.

7 Discussion

7.1 Sentence-similarity quality

The core of the proposed pseudopairing method is based on measuring the similarity between sentences, but how well does our model encode similar sentences? Here we analyze the ability of our models to identify translation equivalent sentences using the English-German translation pairs in the Multi30K test 2016 data. This experiment proceeds as follows: (i) we assume a pre-trained image–sentence ranking model, (ii) we encode the German and English sentences using the language encoder of the model, (iii) we calculate the model’s performance on the task of ranking the correct translation for English sentences, given the German caption, and vice-versa.

To put our results into perspective we compare to the best approach to our knowledge as reported by rotman2018bridging: DPCCA is a deep partial canonical correlation analysis method maximizing the canonical correlation between captions of the same image conditioned on image representations as a third view. Table 7 reports the results of this experiment. Our models consistently improve upon the state-of-the-art. The baseline aligned model trained on the Multi30K data slightly outperforms the DPCCA for EN DE retrieval, and more substantially outperforms DPCCA for DE

EN. If we train the same model with the additional c2c objective, R@1 improves by 8.0 and 12.1 points, respectively. We find that adding more monolingual English data from the external COCO data set slightly degrades retrieval performance, and that performing sentence retrieval using a model trained on the disjoint M30K German and English COCO data sets result in much lower retrieval performance. We conclude that the model that we used to estimate sentence similarity is the best-performing method known for this task on this data set, but there is room for improvement for models trained on disjoint data sets.

DPCCA 82.6 79.1
En + De 82.7 83.4
En + De + c2c 90.6 91.2
En + De + COCO 82.5 81.0
En + De + COCO + c2c 90.0 90.1
De + COCO 73.4 70.7
Table 7: Translation retrieval results (Recall @ 1) on the M30K 2016 test set compared to the state of the art.

7.2 Characteristics of the Pseudopairs

We now investigate the properties of the pseudopairs generated by our method. In particular, we focus on pseudpairs generated by an aligned plus disjoint model (En+De+COCO+c2c) and a disjoint model (De+COCO).

The pseudopairs generated by the aligned plus disjoint model cover 40% of the German captions in the M30K data set, and overall, the pseudopairs form a heavy-tailed distribution. We find a similar pattern for the pseudopairs generated by the disjoint model: the pseudopairs cover 37% of the M30K data set, and the top 150 captions cover 23% of the data. This is far from using each caption equally in the pseudopair transfer, and may suggest a hubness problem Dinu et al. (2014). We assessed the stability of the sets of transferred captions using the Jaccard measure in two cases: (i) different random seeds, and (ii) disjoint or aligned plus disjoint. For the aligned plus disjoint model, we observe an overlap of 0.53 between different random seeds compared to 0.51 for the disjoint model. The overlap between the two types of models is much lower at 0.41. Finally, we find that when a caption is transferred by both models, the overlap of the caption annotating the same COCO image is 0.33 for the disjoint model, and 0.34 for the aligned plus disjoint model, and the overlap between the models is 0.16. This shows that the models do not transfer the same captions for the same images.

Figure 2 presents examples of the annotations transferred using the pseudopair method. The first example demonstrates the difference between the Multi30K and COCO datasets: there are no giraffes in the former, but there are dogs (“Hund”). In the second example, both captions imply that the man sits on the tree not beside it. This shows that even if the datasets are similar, transferring a caption that exactly matches the picture is difficult. The final two examples show semantically accurate and similar sentences are transferred by both models. In the fourth example, both models transfer exactly the same caption.

(a) Ein hund steht auf einem baumstamm im wald.
A dog is standing on a tree trunk in the forest.
Hund im wald.
Dog in the forest.
(b) Mann sitzt im baum.
Man is sitting in the tree.

Der mann der auf einem baum sitzt.
The man sits on the tree.
(c) Ein mann sitzt in einem boot auf einem see.
A man is sitting in a boat on a lake.
Ein mann sitzt am see auf dem ein boot fährt.
A man is sitting at the lake on which a boat is riding.
(d) Ein jet jagt steil in die luft, viel rauch kommt aus dem rumpf.
A jet goes steep up into the air, a lot of smoke is coming out of its hull.
Ein jet jagt steil in die luft, viel rauch kommt aus dem rumpf.
A jet goes steep up into the air, a lot of smoke is coming out of its hull.
Figure 2: Visualisation of the sentences transferred from Multi30K to the COCO data set using the pseudopair method.

is transferred from a model trained on De+COCO, whereas

is transferred from En+De+COCO. English glosses of the sentences are included for ease of reading.
Figure 2:



8 Related Work

Image–sentence ranking is the task of retrieving the sentences that best describe an image, and vice-versa Hodosh et al. (2013). Most recent approaches are based on learning to project image representations and sentence representations into a shared space using deep neural networks (Frome et al., 2013; Socher et al., 2014; Vendrov et al., 2016; Faghri et al., 2017, inter-alia).

More recently, there has been a focus on solving this task using multilingual data Gella et al. (2017); Kádár et al. (2018) in the Multi30K dataset Elliott et al. (2016); an extension of the popular Flickr30K dataset into German, French, and Czech. These works take a multi-view learning perspective in which images and their descriptions in multiple languages are different views of the same concepts. The assumption is that common representations of multiple languages and perceptual stimuli can potentially exploit complementary information between views to learn better representations. For example, rotman2018bridging improves bilingual sentence representations by incorporating image information as a third view by Deep Partial Canonical Correlation Analysis. More similar to our work D17-1303, propose a convolutional-recurrent architecture with both an image–caption and caption–caption loss to learn bilingual visually grounded representations. Their results were improved by the approach presented in kadar2018conll, who has also shown that the multilingual models outperform bilingual models, and that image–caption retrieval performance in languages with less resources can be improved with data from higher-resource languages. We largely follow Kádár et al. (2018), however, our main interest lies in learning multimodal and bilingual representations in the scenario where the images do not come from the same data set i.e.: the data is presented is two sets of image–caption tuples rather than image–caption–caption triples.

Taking a broader perspective, images have been used as pivots in multilingual multimodal language processing. On the word level this intuition is applied to visually grounded bilingual lexicon induction, which aims to learn cross-lingual word representations without aligned text using images as pivots

Bergsma and Van Durme (2011); Kiela et al. (2015); Vulić et al. (2016); Hartmann and Søgaard (2017); Hewitt et al. (2018). Images have been used as pivots to learn translation models only from image–caption data sets, without parallel text Hitschler et al. (2016); Nakayama and Nishida (2017); Lee et al. (2017); Chen et al. (2018).

9 Conclusions

Previous work has demonstrated improved image–sentence ranking performance when training models jointly on multiple languages Gella et al. (2017); Kádár et al. (2018). Here we presented a study on learning multimodal and multilingual representations in the disjoint setting, where images between languages do not overlap. We found that learning visually grounded sentence embeddings in this setting is more challenging. To close the gap, we developed a pseudopairing technique that creates synthetic pairs by annotating the images from one of the data sets with the image descriptions of the other using the sentence similarities of the model trained on both. We showed that training with pseudopairs improves performance, without the need to augment training from additional data sources or other pipeline components. However, our technique is outperformed by creating synthetic pairs using an off-the-shelf automatic machine translation system. As such our results suggest that it is better to use translation, when a good translation system is available, however, in its absence, pseudopairs offer consistent improvements. We have found that our pseudopairing method only transfers annotations from a small number of images and in the future we plan to substitute our naive matching algorithms with approaches developed to mitigate this hubness issue Radovanović et al. (2010) and to close the gap between translation and pseudopairs.


  • M. Baroni (2016) Grounding distributional semantics in the visual world. Language and Linguistics Compass 10 (1), pp. 3–13. Cited by: §1.
  • S. Bergsma and B. Van Durme (2011)

    Learning bilingual lexicons using the visual similarity of labeled web images


    IJCAI Proceedings-International Joint Conference on Artificial Intelligence

    Vol. 22, pp. 1764. Cited by: §8.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §3.2.
  • Y. Chen, Y. Liu, and V. O. Li (2018)

    Zero-resource neural machine translation with multi-agent communication game

    arXiv preprint arXiv:1802.03116. Cited by: §8.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §3.1.
  • G. Dinu, A. Lazaridou, and M. Baroni (2014) Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568. Cited by: §7.2.
  • D. Elliott, S. Frank, K. Sima’an, and L. Specia (2016) Multi30K: multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pp. 70–74. Cited by: §1, §1, §8.
  • D. Elliott and À. Kádár (2017) Imagination improves multimodal translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 130–141. Cited by: §1.
  • F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2017) Vse++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 2 (7), pp. 8. Cited by: §2, §3.2, §8.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §8.
  • S. Gella, R. Sennrich, F. Keller, and M. Lapata (2017) Image pivoting for learning multilingual multimodal representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2839–2845. Cited by: §1, §8, §9.
  • M. Hartmann and A. Søgaard (2017) Limitations of cross-lingual learning from image search. arXiv preprint arXiv:1709.05914. Cited by: §8.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.1.
  • J. Hewitt, D. Ippolito, B. Callahan, R. Kriz, D. T. Wijaya, and C. Callison-Burch (2018) Learning translations via images with a massively multilingual image dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2566–2576. Cited by: §8.
  • J. Hitschler, S. Schamoni, and S. Riezler (2016) Multimodal pivots for image caption translation. arXiv preprint arXiv:1601.03916. Cited by: §8.
  • M. Hodosh, P. Young, and J. Hockenmaier (2013)

    Framing image description as a ranking task: data, models and evaluation metrics

    Journal of Artificial Intelligence Research 47, pp. 853–899. Cited by: §2, §8.
  • Á. Kádár, D. Elliott, M-A. Côté, G. Chrupała, and A. Alishahi (2018) Lessons learned in multilingual grounded language learning. In Proceedings of the 21st Conference on Computational Natural Language, Cited by: §1, §2, §2, §4, §4, §8, §9.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §3.2.
  • D. Kiela, A. Conneau, A. Jabri, and M. Nickel (2017) Learning visually grounded sentence representations. arXiv preprint arXiv:1707.06320. Cited by: §1.
  • D. Kiela, I. Vulic, and S. Clark (2015) Visual bilingual lexicon induction with transferred convnet features. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Cited by: §8.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
  • G. Klein, Y. Kim, Y. Deng, V. Nguyen, J. Senellart, and A. Rush (2018) OpenNMT: neural machine translation toolkit. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), Boston, MA, pp. 177–184. Cited by: §3.1.
  • A. Lazaridou, N. T. Pham, and M. Baroni (2015) Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598. Cited by: §1.
  • J. Lee, K. Cho, J. Weston, and D. Kiela (2017) Emergent translation in multi-agent communication. arXiv preprint arXiv:1710.06922. Cited by: §8.
  • X. Li, W. Lan, J. Dong, and H. Liu (2016) Adding chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 271–275. Cited by: footnote 2.
  • H. Nakayama and N. Nishida (2017) Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot. Machine Translation 31 (1-2), pp. 49–64. Cited by: §8.
  • F. Pulvermüller (2005) Brain mechanisms linking language and action. Nature reviews neuroscience 6 (7), pp. 576. Cited by: §1.
  • M. Radovanović, A. Nanopoulos, and M. Ivanović (2010)

    On the existence of obstinate results in vector space models

    In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 186–193. Cited by: §9.
  • A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk (2017) Breakingnews: article annotation by image and text processing. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1072–1085. Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: footnote 4.
  • S. Schamoni, J. Hitschler, and S. Riezler (2018) A dataset and reranking method for multimodal MT of user-generated image captions. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, pp. 140–153. Cited by: §1.
  • R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2, pp. 207–218. Cited by: §8.
  • L. Specia, S. Frank, K. Sima’an, and D. Elliott (2016) A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Vol. 2, pp. 543–553. Cited by: §3.3.
  • I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun (2016) Order-embeddings of images and language. ICLR. Cited by: §3.2, §8.
  • I. Vulić, D. Kiela, S. Clark, and M. Moens (2016) Multi-modal representations for improved bilingual lexicon learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 188–194. Cited by: §8.
  • Y. Yoshikawa, Y. Shigeto, and A. Takeuchi (2017) STAIR captions: constructing a large-scale japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 417–421. Cited by: §1.