Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

by   Jean-Benoit Delbrouck, et al.

In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs used are pre-trained on object detection and localization task. We hypothesize that richer architecture, such as dense captioning models, may be more suitable for MNMT and could lead to improved translations. We extend this intuition to the word-embeddings, where we compute both linguistic and visual representation for our corpus vocabulary. We combine and compare different confi



There are no comments yet.


page 3

page 4


UMONS Submission for WMT18 Multimodal Translation Task

This paper describes the UMONS solution for the Multimodal Machine Trans...

LIUM-CVC Submissions for WMT17 Multimodal Translation Task

This paper describes the monomodal and multimodal Neural Machine Transla...

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Detecting visual relationships, i.e. <Subject, Predicate, Object> triple...

Equalizing Gender Biases in Neural Machine Translation with Word Embeddings Techniques

Neural machine translation has significantly pushed forward the quality ...

Multimodal Pivots for Image Caption Translation

We present an approach to improve statistical machine translation of ima...

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

Multimodal machine translation (MMT), which mainly focuses on enhancing ...

Predicting Actions to Help Predict Translations

We address the task of text translation on the How2 dataset using a stat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have shown great performance on the Machine Translation (MT) task. The encoder-decoder framework [1] has been since widely adopted. An attention mechanism has been introduced by Bahdanau et al. [2] to learn to focus on different parts of the input sentence while decoding. Other modalities, like images, can make use of such attention mechanisms. A previous work [3] has shown they are able to learn to attend to the salient parts of an image when generating a text captions.

Integrating multimodal information efficiently still remains a challenge. It requires combining diverse modality vector representations. A few attempts

[4, 5, 6, 7, 8] have been made during the WMT 2016 Multimodal Machine Translation evaluation campaign. These initial efforts have not convincingly demonstrated that visual context can improve translation quality. Meanwhile, few improvements have been made, [9] proposed a doubly-attentive decoder that outperformed all previous baselines with less data and without re-scoring, [10]

tried multiple attention models and image attention optimizations such as the gating

[3] and pre-attention [11] mechanism. Recently, [12] introduced a model where visually grounded representations are learned.

In this paper, our aim is to propose a first empirical investigation on improving MNMT by using improved visual and word representations. More specifically, we believe that visual and word representations obtained through models pre-trained on large data sets should bring further improvement. Most importantly, we want to leverage models that provide a closer link between image understanding and language understanding. For extracting image modality vector representations, we will make use of a model trained on a dense captioning task, namely DenseCap [13]

. Compared to models trained on object recognition tasks (such as ImageNet

[14], as used in previous MNMT proposal), the hope is that the representation contains richer information, also encoding object attributes and important relationship for linguistic description of the images. For extracting the vector representations of the word modality, we will make use of word vectors obtained from large scale text corpora, but we will also use visual representations of the referents of those words, using a recent paradigm of ”imagined” visual representations of those words. The hope is that these visually grounded word representation will facilitate the integration of both modalities during the decoding process, hence improving translation results.

Our paper is structured as follows. In section 2, we briefly describe our NMT model as well as the conditional GRU activation used in the decoder. We also explain how multi-modalities can be implemented within this framework. In the following section 3 and 4, we detail the process of our visual embeddings and features creation. Finally, we report and analyze our results in section 6.

2 Model

2.1 Text-based NMT

We describe the attention-based NMT model introduced by [2] in this section. Given a source sentence

, the neural network directly models the conditional probability

of its translation . The network consists of one encoder and one decoder with one attention mechanism. Each source word and target word are a column index of the embedding matrices and

. The encoder is a bi-directional RNN with Gated Recurrent Unit (GRU) layers

[15, 16], where a forward RNN reads the input sequence as it is ordered (from to ) and calculates a sequence of forward hidden states . A backward RNN reads the sequence in the reverse order (from to ), resulting in a sequence of backward hidden states . We obtain an annotation for each word by concatenating the forward and backward hidden state . Each annotation contains the summaries of both the preceding words and the following words. The representation for each source sentence is the set of annotations .

The decoder is an RNN that uses a conditional GRU111 (cGRU) with an attention mechanism to generate a translated word . More precisely, the conditional GRU has three main components computed at each time step of the decoder:

  • REC1 computes a hidden state proposal based on the previous hidden state and the previously emitted word ;

  • 222called ATT in the aforementioned paper is an attention mechanism over the hidden states of the encoder and computes a time-dependent context vector using the annotation set and the hidden state proposal ;

  • REC2 computes the final hidden state using the hidden state proposal and the context vector ;

Both and are further used to decode the next symbol. We use a deep output layer [17] to compute a vocabulary-sized vector :


where , , , are model parameters. We can parameterize the probability of decoding each word as:


The initial state of the decoder at time-step is initialized by the following equation :


where is a feed-forward network with one hidden layer.

We use the soft attention mechanism for the component. Soft attention has firstly been used for syntactic constituency parsing by [18] but has been widely used for translation tasks ever since. The idea of the soft attentional model is to consider all the annotations when deriving the context vector . It consists of a single feed-forward network used to compute an expected alignment between text annotation and the target word to be emitted at the current time step . The inputs are the annotations and the intermediate representation of REC1 :


where is the normalized alignment matrix between each source annotation vector and the word to be emitted at time step . In the above expressions, , and are trained parameters. Finally, the modality time-dependent context vector is computed as a weighted sum over the annotation vectors (equation 6).


2.2 Multimodal NMT (MNMT)

In multimodal NMT, the second modality is usually an image, for which feature maps are computed using a Convolutional Neural Network (CNN). The annotations are spatial features (i.e. each annotation represents features for a specific region in the image). More formally, given a set of image modality annotations , we compute a an image context vector based on the same intermediate hidden state proposal:


This new time-dependent context vector is an additional input to a modified version of REC2 which now computes the final hidden state using the intermediate hidden state proposal and both time-dependent context vectors (for text) and (for image). In addition, is weighted with the gating scalar mechanism as seen in [3]


The probabilities for the next target word (from equation 1) also takes into account the new context vector :


where is a new trainable parameter.

3 Improved Word Embeddings

In previous works on MNMT, word embeddings are usually trained along with the model. Both matrices and are considered trained model parameters. This approach does not allow to exploit large scale text corpora that can be available in the source language and that could be leveraged to obtain useful distributed semantic representations of the words (such as Word2Vec [19], or Glove [20]). Here, we will make use of Glove to build a multimodal representation, textual and visual, for our whole source vocabulary. To do so, we try out an effective method that learns a language-to-vision mapping as described in [21]. The learned model outputs visual predictions of a word given its semantic representation.

3.1 Language-to-vision mapping

Concretely, we consider two embedding spaces: a linguistic space and a visual space where and are the sizes of the text and visual representations respectively. For a given dataset of words , each word has a linguistic representation and a visual representation . The aim is to learn a mapping function such that the prediction (or imagined representations) is close to the actual visual vector . A training example is thus a pair and the dataset is composed of examples.

3.2 Visually Grounded Word Embeddings

ImageNet is used as source of visual information. ImageNet covers a total of 21,841 WordNet synsets and has 14,197,122 images. For the experiment, only synsets with more than 50 images are kept, and an upper bound of 500 images per synset is used to reduce computation time. With this restriction, 9,251 unique words are covered. The training set is composed of = 9,251 examples. For each of these words, we use a pre-trained VGG-m-128 CNN model [22] to extract visual features from each image. We take the 128-dimensional activation of the last layer. The visual representation is computed as an element-wise averaging of the features vectors from different images picturing the object the word refers to. As previously mentioned, the textual representation is obtained with the word embeddings algorithm Glove. We use the pre-trained model on the Common Crawl corpus consisting of 840B tokens and a 2.2M words. The mapping function

consists of a simple perceptron composed of a

dimensional input layer and a linear output layer with units.

3.3 Imagined multimodal embeddings

We use the pre-trained model made available by [21]

. Their training is done with a mean squared error (MSE) loss function and stochastic gradient descent optimizer. A learning rate of 0.1 and dropout rate of 0.1 is chosen, running for 175 epochs. GloVe vectors are of size

= 300 and a low-dimensional = 128 is picked to reduce the number of parameters and thus the risk of overfitting. The multimodal representation of word is built by concatenating the 2-normalized imagined representations with the textual representations . Hence, the multimodal representation is of size 428. We compute it using our model for every word in our source vocabulary.

Model Test Scores
Previous work
[9] ResNet-50 + along (620) 36.50 55.0 43.7
[12] GoogLeNet v3 + along (620) 36.8 0.8 55.8 0.4 -
[12] GoogLeNet v3 + along (620) + COCO + NC 37.8 0.7 57.1 0.2 -
(B1) ResNet-50 + along (428) 36.27 53.9 43.6
(B2) DenseCap + along (428) 37.78 +1.51 54.6 +0.7 42.3 -0.3
Linguistic word embeddings
(L1) ResNet-50 + GloVe (300) 37.40 +1.13 55.0 +1.1 42.1 -0.5
(L2) DenseCap + GloVe (300) 37.95 +1.68 55.7 +1.8 42.1 -0.5
Imagined multimodal word embeddings
(M1) ResNet-50 + GloVe + Visual (428, fixed) 35.52 -0.75 53.7 -0.2 43.3 +0.7
(M2) ResNet-50 + GloVe + Visual (428) 37.51 +1.24 55.2 +1.3 42.1 -0.5
(M3) DenseCap + GloVe + Visual (428) 38.20 +1.93 55.7 +1.8 41.9 -0.7
Table 1: Results on the test triples of the Multi30K dataset. ”Along” means embeddings are initialized with a Gaussian and trained along with the model. ”Fixed” means that the embeddings are frozen during the whole training. Embeddings size are between brackets at the end of each model description. Each score is compared with the model B1.

4 Visual Features

So far in MNMT, the spatial visual features of the images are extracted with a 16 or 19-layers version of VGGNet, or a Deep Residual Network [23] pre-trained on ImageNet for an object detection and localization task. As mentioned in [JBempiricalEMNLP], such features may not be suited for the translation of complex captions, which involves objects but also their attributes and relationships (as shown in Figure 1

). In this work, we focus on using features extracted on a model pre-rained for a dense captioning task, namely DenseCap pre-trained on Visual Genome


. Its architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. The CNN is pretrained on ImageNet and fine-tuned during the training (except for the first four convolutional layers). We extract the features at the last convolutional layer. Due to the high sparsity of these features, we apply an

In this experiment, we compare two image annotations used by our decoder: features of size extracted from a ResNet-50 pre-trained on ImageNet (at its res4f layer) and from DenseCap as described above.

Figure 1: Left: Object localization and detection. Right: Dense captioning

5 Dataset and model settings

For this experiments on Multimodal Machine Translation, we used the Multi30K dataset [25] which is an extended version of the Flickr30K Entities. For each image, one of the English descriptions was selected and manually translated into German by a professional translator. As training and development data, 29,000 and 1,014 triples are used respectively. A test set of size 1000 is used for metrics evaluation.

All our models are build on top of the nematus framework [26]. The encoder is a bidirectional RNN with GRU, one 1024D single-layer forward and one 1024D single-layer backward RNN. Non-recurrent matrices are initialized by sampling from a Gaussian

, recurrent matrices are random orthogonal and bias vectors are all initialized to zero. The word embeddings matrices

and are either trained along the model and initialized accordingly or pre-trained as described in section 3. Embeddings size depends on the experiment and are explicitly mentioned in the score tabular (Table 1) for every model. We apply dropout with a probability of 0.3 on the embeddings, on the hidden states in the bidirectional RNN in the encoder as well as in the decoder. In the decoder, we also apply dropout on the text annotations , the image features , on both modality context vector and on all components of the deep output layer before the readout operation. Dropout is applied using one same mask in all time steps [27].

We normalize and tokenize English and German descriptions using the Moses tokenizer scripts [28]. We use the byte pair encoding algorithm on the target train set to convert space-separated tokens into subwords [29], reducing the German vocabulary to 14957 words. Our source vocabulary, in English, is of size 11187. The visual features of the Flickr30K images are extracted with DenseCap or ResNet-50 in our experiments (as described in section 4).

All variants of our model were trained with ADADELTA [30], with mini-batches of 40 examples. We apply early stopping for model selection based on BLEU4 : training is halted if no improvement on the development set is observed for more than 20 epochs. We use the metrics BLEU4 [31], METEOR [32] and TER [33] to evaluate the quality of our models’ translations.

6 Results and Future work

We report our results in table 1. We structure our analysis in three main sections. We start by discussing the effectiveness of the DenseCap features, followed by our impressions about the impact of the multimodal word embeddings. Finally, we qualitatively compare the translations of some of our models to get more concrete and tangible results of our choices.

Flickr30K Visual Features  We observe a noticeable improvement when using DenseCap visual features instead of ResNet-50. We show an amelioration, between model B1 and B2, of +1.51 BLEU and +0.7 METEOR when embeddings are trained along with the model. With pre-trained GloVe embeddings, between systems L1 and L2, we notice an improvement of +0.55 BLEU and 0.7 METEOR. Lastly, with multimodal embeddings, model M3 scores +0.69 BLEU and +0.5 METEOR over M2. Overall, the results confirm our intuition that rich dense captioning task improves translation’s quality, especially on the METEOR metric which proves that the attention model benefits from this change. The use of pre-trained embeddings lowers the gap betweens image features efficiency (ie. the scores difference between M2-M3 and L1-L2 are slightly lower than M1-M2). Moreover, it is interesting to note that the Denscap features (of size ) are twice smaller than the ResNet-50 features ().

Word embeddings  Obviously, adding GloVe word embeddings helps the model to better translate. Yet, the addition of the visual embeddings to the linguistic embeddings only brings a small improvement: +0.11 BLEU and +0.2 METEOR for ResNet-50 (L1-M2) and +0.35 BLEU and +0.0 METEOR for DenseCap (L2-M3). We can hypothesize two reasons. Firstly, the visual embeddings of size 128 may be too small to be really significant in rather large networks such as these presented and the model in section 3 might need some changes. Another explanation could be that the visual information used during the training of the mapping function (the function that outputs our visual embeddings of size 128) are extracted with a VGG-128 network. However, our models are trained using a ResNet-50 or DenseCap network for visual features extraction. It is possible that too much model parameters are requested to do the mapping between the two embedding spaces (the word embeddings and the visual features) and therefore impact the models’ translation quality. On a side note, we tried to freeze the loaded embeddings during the training (referred as ”fixed” in the score tabular) but lead to unpleasant results of -0.75 BLEU and -0.2 METEOR (B1-M1). We conclude that the models needs to slightly change the representations of the words in order to generate the strongest textual and visual context vectors.

Translation comparison  We illustrate some hand-picked examples of significant improvements on the test set we noticed between our models. We start by comparing the two baseline systems B1 and B2. We pick a sentence that involves a positional relationship between a man and a dog (Figure 2):

Source: A man is dancing with a dog between his legs .
Reference: Ein Mann tanzt mit einem Hund zwischen den
Beinen .
B1: Ein Mann tanzt und ein Hund zwischen seinen
Beinen .
B2: Ein Mann tanzt mit einem Hund zwischen den
Beinen .

The sentence-level BLEU score is of 100 for B2 (DenseCap) and 31.40 for B1 (ResNet-50). We now compare our two best models M2 and M3 on a more complex and descriptive sentence:

Source: A person in a red jacket with black pants
holding rainbow ribbons .
Reference: Eine Person in einer roten Jacke und schwarzen
Hosen hält Regenbogenbänder .
M2: Eine Person in roter Jacke mit schwarzen
Hosen hält eine Znde in der Hand .
M3: Eine Person in einer roten Jacke und schwarzen
Hosen hält einen Zebnde .

M3 and M2 respectively have a sentence-level BLEU score of 77.19 and 23.66.

Figure 2: Left: First sentence Right: Second sentence

In future work, it would be interesting to use the same CNN for all the model’s components (multimodal word embeddings and visual features). The experiment could be re-attempt with DenseCap on all fronts. Another work could be a more in-depth investigation of the attention model behavior using DenseCap. Its not clear if the improved behavior learned by the model is brought by the word-embeddings (and therefore GloVe and the VGG-128 of section 3) or the attention model on the visual features (extracted with DenseCap).

7 Acknowledgements

This work was partly supported by the Chist-Era project IGLU with contribution from the Belgian Fonds de la Recherche Scientique (FNRS), contract no. R.50.11.15.F, and by the FSO project VCYCLE with contribution from the Belgian Waloon Region, contract no. 1510501.