Log In Sign Up

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences in the representational properties induced by the two architectures. It also has been shown that visual information serves as one of the means for grounding sentence representations. In this paper, we present a meta-study assessing the representational quality of models where the training signal is obtained from different modalities, in particular, language modeling, image features prediction, and both textual and multimodal machine translation. We evaluate textual and visual features of sentence representations obtained using predominant approaches on image retrieval and semantic textual similarity. Our experiments reveal that on moderate-sized datasets, a sentence counterpart in a target language or visual modality provides much stronger training signal for sentence representation than language modeling. Importantly, we observe that while the Transformer models achieve superior machine translation quality, representations from the recurrent neural network based models perform significantly better over tasks focused on semantic relevance.


page 1

page 2

page 3

page 4


CUNI System for the WMT18 Multimodal Translation Task

We present our submission to the WMT18 Multimodal Translation Task. The ...

Neural Language Modeling with Visual Features

Multimodal language models attempt to incorporate non-linguistic feature...

Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations

Sentence representations can capture a wide range of information that ca...

Inducing Constituency Trees through Neural Machine Translation

Latent tree learning(LTL) methods learn to parse sentences using only in...

Probing the Need for Visual Context in Multimodal Machine Translation

Current work on multimodal machine translation (MMT) has suggested that ...

Towards Multimodal Simultaneous Neural Machine Translation

Simultaneous translation involves translating a sentence before the spea...

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...

1 Introduction

Conditioning on multimodal information is one of the predominant methods of grounding representation learned in deep learning models

(Chrupała et al., 2015; Lazaridou et al., 2015), i.e., relating the word or sentence representation to non-linguistic real-world entities such as objects in photographs. In the context of multimodal machine translation (MT), models using multimodal auxiliary loss have been shown to outperform their text-only counterparts (Elliott and Kádár, 2017; Helcl et al., 2018). Experiments with multimodal language models (LMs) also confirm that multimodality influences the semantic properties of learned representations (Poerner et al., 2018).

On the other hand, recent experiments with large-scale language modeling suggest that these models provide sufficiently informative representations reusable in most natural language processing (NLP) tasks

(Peters et al., 2018; Devlin et al., 2018). Current research has also seen an increasing trend towards investigation on universality of learned representations where the learned representations are supposed to contain sufficient inductive biases for a variety of NLP tasks (Conneau et al., 2017; Howard and Ruder, 2018).

Research in evaluating representations has focused on measuring the correlation between the similarity of learned representations and the semantic similarity of words (Hill et al., 2015; Gerz et al., 2016) and sentences (Agirre et al., 2012, 2016). Work on probing representations include relating learned representations to existing well-trained models by finding a mutual projection between the learned representations and evaluating the performance of the projected representations within the trained model (Saphra and Lopez, 2018)

, and observing the effect of changes in the representation by backpropagating the changes to the input

(Poerner et al., 2018).

Universal sentence representations are typically evaluated on its effects on downstream tasks. Conneau and Kiela (2018) and Wang et al. (2018) recently introduced comprehensive sets of such downstream tasks providing a benchmark for the sentence representation evaluation. The tasks include various sentence classification tasks, entailment or coreference resolution. However, the drawback of these methods is that they require generating representations of millions of sentences which are later used for a rather time-consuming training of models for the downstream tasks.

In this paper, we investigate representations obtained specifically from grounded models using the two predominant sequence modeling architectures: a model based on recurrent neural networks (RNN; Mikolov et al., 2010; Bahdanau et al., 2014) and a model based on the self-attentive Transformer architecture (Vaswani et al., 2017). We study the learned representations on aspects of grounding, semantics and the degree to which some of these representations are correlated, irrespective of modeling choices. Our main observations are: a) models with access to explicit grounded information learn to ignore image information; b) grounding accounts for better semantic representations as it provides a stronger training signal and is especially pronounced when a model has access to less training samples; c) while Transformer based models might have better task performance, we observe that RNN based models capture better semantic information.

2 Assessing Contextual Representations

In this section, we briefly describe the methods used for extracting representations and for quantifying the representation qualities: Canonical Correlation Analysis (CCA) for image retrieval evaluation, and cosine distance for Semantic Textual Similarity evaluation. Finally, we also use Distance Correlation (DC) for representation similarity evaluation. Whereas the first two of them are used for evaluation on downstream tasks, the latter one is only quantifies mutual similarities of the representations.

Canonical Correlation Analysis.

We take input as the two sets of aligned representations from two different subspaces, say and , where and

are vector representations. CCA

(Hotelling, 1936) finds pairs of directions , such that the linear projections of and onto these directions, i.e., the canonical representations and , are maximally correlated. For, further details on CCA, we refer the reader to Hardoon et al. (2004).

The most significant property of CCA for our analysis is that CCA is a subspace only method where we obtain naturally occurring correlations between two spaces. Importantly, we don’t learn to align, but obtain alignments that are potentially present between the two subspaces. Further, CCA is affine-invariant due to its reliance on correlation rather than orthogonality of direction vectors.

We use CCA over mean-pooled sentence representations and image representations and obtain two highly correlated projections respectively. CCA and its variants have been used in previous research to obtain cross-modal representations Gong et al. (2014); Yan and Mikolajczyk (2015). We evaluate the projected representations on image retrieval task and report the recall at 10. Note that we do not backpropagate the correlation to the network and keep the representation fixed because our goal is not training towards optimal cross-modal representation but only to asses the (already trained) sentence representation.

Cosine Distance.

For evaluation on the STS task, we use cosine distance between of vectors and :

Following the SentEval benchmark (Conneau and Kiela, 2018), we report the Spearman correlation between the distance and human assessments.

The goal of the STS task is to asses how well the representation capture semantic similarity of sentences as perceived by humans. Similar to the image retrieval task, we do not fine-tune the representations for the similarity task and report the Spearman correlation of the cosine distance between the representations and the ground-truth similarity.

Distance Correlation.

Distance correlation (DC) is a measure of dependence between any two paired vectors of arbitrary dimensions Székely et al. (2007). Given, two paired vectors, and and suppose that and

are the individual characteristic functions and joint characteristic function of the two vectors respectively. The distance covariance

between and

with finite first moments is a non-negative number given by:

where ; and are the dimensionalities of and respectively. The distance correlation (DC) is then defined as:

A detailed description of the DC is beyond the scope of this paper, but we refer the reader to Székely et al. (2007) for a thorough analysis.

Our use of DC is motivated by the result that DC quantifies dependence measure, especially it equals zero exactly when the two vectors are mutually independent and are not correlated. Also, DC measures both linear and non-linear association between two vectors. We use DC to measure the degree of correlation between different representations. We are especially interested in studying the degree to which two independently learned representations are correlated.

3 Experiments

We examine representations for four types of models: a) LMs; b) image representation prediction models (Imaginet); c) textual MT; and d) multimodal MT models. For each task, we train models based using RNNs and the Transformer architecture. In addition, we use training datasets of different sizes. All models trained with Neural Monkey111 (Helcl and Libovický, 2017b).

3.1 Models

Language Models.

We trained an RNN LM with a single GRU layer (Cho et al., 2014) of 1000 dimensions end embeddings of 500. The Transformer LM (Vaswani et al., 2017) has model dimension 512, 6 layers, 8 attention heads and hidden layer size 4,096.


The Imaginet models (Chrupała et al., 2015) predict image representation given a textual description of the image. The representations is trained only via its grounding in the image representation.

We use a bidirectional RNN encoder with the same hyperparameters as the aforementioned LM. The Transformer based Imaginet uses the same hyperparameters as the Transformer based LM. The states of the encoder are then mean-pooled and projected with a hidden layer of 4,096 and ReLU non-linearity to a 2,048-dimensional vector corresponding to the image representation from the

ResNet (He et al., 2016). For a fair comparison, we use the representation before the final non-linear projection.

For completeness, We also compare the LMs with ELMo (Peters et al., 2018), a representation based on deep RNN LM with character-based embeddings pre- trained on a large corpus, of 30 million sentences, and BERT (Devlin et al., 2018), a Transformer based sentence representation that is similar to Transformer based LM. We note however that BERT is trained in a significantly different procedure than regular LMs.

Textual MT models.

We trained the attentive RNN based seq2seq model (Bahdanau et al., 2014) with the same hyperparameters as the RNN Imaginet model, and with the conditional GRU (Firat and Cho, 2016) as the decoder. With the Transformer architecture, we used the same hyperparameters as for the Imaginet models.

Besides the text-only models, we trained Imagination models (Elliott and Kádár, 2017) that combine the translation with the Imaginet models in a multi-task setup. The model is trained to generate a sentence in target language and predict image representation at the same time.

With multi-task learning, the model takes advantage from large parallel data without images and monolingual image captioning data at the same time. Presumably, the model achieves a superior translation quality by being able to learn a better source sentence representation. At the inference time, the only requires the textual input.

Multimodal MT models.

For both RNN and Transformer architectures, we used the same hyperparameters as for the textual models. As in previous models, we use last convolutional layer of ResNet as image representation.

In the RNN setup, we experiment with decoder initialization with image representation (Caglayan et al., 2017; Calixto and Liu, 2017) and with doubly attentive decoder with three different attention combination strategies (Libovický and Helcl, 2017)

. First, we concatenate context vectors computed independently over the image representation and source sentence; second (flat attention combination), we compute a joint distribution over the image convolutional maps and the source encoder; third (hierarchical attention combination), we compute the context vectors independently and combine them hierarchically using another attention mechanism.

In the Transformer setup, the multimodal models use doubly attentive decoders (Libovický et al., 2018). We experiment with four setups: serial, parallel, flat and hierarchical input combination. The first two are a direct extension of the Transformer architecture by adding more sublayers in the decoder. The latter ones are a modification of the attention strategies for in the RNN setup.

3.2 Datasets

Training data.

To evaluate how the representation quality depends on the amount of the training data, we train our models on different datasets. The smallest dataset that is used for all types of experiments is Multi30k (Elliott et al., 2016) that consists of only 29k training images with English captions and their translations into German, French, and Czech.

For monolingual experiments (LM and image representation prediction) we further use English captions from the Flickr30k dataset (Plummer et al., 2015) that contains 5 captions for each image, in total 145k. The largest monolingual dataset we work with is a concatenation of Flickr30k and the COCO dataset (Lin et al., 2014), with 414k descriptions of 82k images.

For textual MT, where parallel data are needed, we also consider an unconstrained setup with additional data harvested from parallel and monolingual corpora (Helcl and Libovický, 2017a; Helcl et al., 2018) combined with the EU Bookshop corpus (Tiedemann, 2012), in total of 200M words.

Multimodal MT models are trained on the Multi30k data only.

Evaluation data.

We fit the CCA on the 29k image-sentence pairs of the training portion of the Multi30k and evaluate on the 1k pairs from the test set.

For STS, we evaluate the representations on the SemEval 2016 dataset (Agirre et al., 2016). The test set consists of 1,186 sentence pairs collected from datasets of newspaper headlines, machine translation post-editing, plagiarism detection, and question-to-question and answer-to-answer matching on Stack Exchange data. Each sentence pair is annotated with a similarity value.

Language Model ppl. Img. STS


Multi30k 12.10 16.6 .267
Flickr30k 11.80 22.4 .340
Flickr30k + COCO 11.80 23.0 .378


Multi30k 12.42 08.9 .256
Flickr30k 11.87 17.6 .283
Flickr30k + COCO 11.69 21.0 .303
ELMo 28.4 .631
BERT 22.4 .624



Img. STS


Multi30k 29.5 24.4 .401
Flickr30k 37.8 26.3 .483
Flickr30k + COCO 39.4 25.4 .501


Multi30k 25.5 22.1 .338
Flickr30k 36.6 29.5 .436
Flickr30k + COCO 38.4 28.0 .451

Textual MT


Img. STS


Textual 36.7 22.5 .527
Textual U 38.7 21.8 .621
Imagination 36.8 20.1 .550
Imagination U 38.2 27.4 .622


Textual 38.3 18.8 .374
Textual U 40.4 21.3 .509
Imagination 39.2 26.5 .433
Imagination U 42.6 31.9 .512

Multimodal MT


Img. STS


Decoder init. 36.9 16.6 .536
Att. concatenation 35.7 11.4 .429
Flat att. comb. 34.6 14.6 .487
Hierar. att. comb. 37.6 16.7 .553


Serial att. comb. 38.7 15.8 .383
Parallel att. comb. 38.6 16.8 .398
Flat att. comb. 37.1 16.6 .385
Hierar. att. comb. 38.5 14.3 .346
Table 1: Recall at 10 for image retrieval (‘Img.’) and Spearman correlation for the Sentence similarity task (‘STS’) for representation extracted the models. ‘U’ denotes use of the unconstrained dataset. The first column contains task specific metrics on the Multi30k test set: LM perplexity, image Recall at 10 and BLUE score, resepectively.

4 Results & Discussion

Figure 1: Distance correlation of representations from pairs of selected models.

We present image retrieval and STS along with the task-specific metrics in Table 1. We observe that on moderately sized datasets, models conditioned on target language and visual modality provide a stronger training signal for learning sentence representations than models trained with simple language modeling objective.

The unconstrained variant of the RNN MMT models obtains a similar performance in the STS as the ELMo and BERT models even though the training samples was orders of magnitude fewer.

We also observe that while the Transformer based models achieve a superior translation quality on the MT tasks, the results on STS suggest that RNN models obtain semantically richer representations. While the textual RNN translation models perform better on image retrieval than the Transformer models, but the other way round with Transformer based Imagination models that are explicitly trained to predict the image representation perform better than their RNN counterparts. With these consistent observations, we posit that the Transformer based models, while achieving good performance on the task it is trained for, seem to ignore image information.

The slight difference between the image retrieval performance of the Imaginet and Imagination models suggest that training the representation using the vision and the target language signal is complementary.

We also evaluated the STS performance of the representations with the CCA projections. The Spearman’s correlation is consistently worse by about .

The encoder of the multimodal MT models that explicitly use the visual input in the decoder achieve significantly lower image retrieval scores. This observation suggests that the textual encoder seems to ignore information about visual aspects of the meaning as the decoder has full access to this information from the explicit conditioning on image representations. This observation is in line with the conclusions of the adversarial evaluation (Elliott, 2018; Libovický et al., 2018).

Figure 2: Plot of dependence of the BLEU score on the Spearman correlation on the STS dataset.

Our experiments also indicate that the performance on STS is highly correlated with the translation quality for both the RNN based and the Transformer based models (see Figure 2) which is in contrast in findings of Cífka and Bojar (2018) who measured correlation of BLEU score and STS under similar conditions. In addition, we observe that Transformers perform significantly worse with STS than their RNN counterparts. The translation quality also appears to be highly correlated with the amount of available training data and image retrieval abilities of the representation (see Table 2).

Correlation of BLEU and … Trans. RNN
Image retrieval R@10 .825 .700
STS performance .852 .873
Training data size .867 .724
Table 2: Pearson correlation of MMT performance and representation properties.

The result of DC for selected models are shown in Figure 1. The DC of the image and the sentence representations is proportional to the image retrieval score, also, images have the least correlation distance resulting in poorer resultant CCA based projections. Sentence representations seem to be more similar among the tasks than among the architectures. Most notable is the mutual similarity of representation from all MT systems regardless of the architecture and the modality setup.

5 Conclusions

We conducted a set of controlled and thorough experiments to asses the representational qualities of monomodal and multimodal sequential models with predominant architectures. Our experiments show that grounding, in either the visual modality or with another language, especially their combination in the Imagination models, results in better representations than LMs trained on datasets of similar sizes. We also showed that the translation quality of the MT models is highly correlated both, with the ability of the models to retain image information and with the semantic properties of the representations.

The RNN models tend to perform better on both the semantic similarity and image retrieval tasks, although they do not reach the same translation quality. We hypothesize this is because of the differences in the architecture that allows the Transformer network to directly access information that the RNN needs to pass in its hidden states.


Jindřich received funding from the Czech Science Foundation, grant no. 18-02196S.