Since the first multimodal machine translation (MMT) shared task (specia-EtAl:2016:WMT) has been released, the community struggled to prove the effectiveness of images in the translation process. Most of the works (caglayan-EtAl:2016:WMT; huang-EtAl:2016:WMT; calixto-elliott-frank:2016:WMT; delbrouck-dupont:2017:EMNLP2017) naturally focused on using a soft attention mechanism (BahdanauCB14) on the convolutional features (also called attention maps) of an image, alongside with a textual attention mechanism, because this approach has shown great success in image captioning (icml2015_xuc15). First attempts were relatively unsuccessful (i.e. slightly lower than a strong monomodal baseline) and it was hard to figure out the real reasons of these underwhelming results. The last multimodal translation shared task (ElliottFrankBarraultBougaresSpecia2017) decided to address this issue by releasing two new test-sets containing pictures of new Flickr groups and sentences with ambiguous verbs so we know for sure the image could play a disambiguation role in the translation process. At the same time tough, the monomodal baseline got stronger and stronger with new findings regarding recurrent network architectures, such as layer normalization (ba2016layer), making the improvements brought by a new modality thiner. The most successful recent try (caglayan-EtAl:2017:WMT)
focused on using the max-pooled features extracted from a CNN to modulate some components of the system (i.e. the target embeddings). So far, researchers extract the image features from a pre-trained CNN without any intervention from the encoder-decoder model used for translation.
We decide to take the leap and to modulate the feature extraction process by the linguistic input. More precisely, this paper aims to :
Give a first try on a fully end-to end (visual and textual) multimodal translation model;
Condition the forward pass of the CNN to extract visual features according to the textual encoder;
Propose an encoder-based image attention model as opposed to the conventional attention mechanism used during decoding time;
In the area of NMT, two works are related to ours, in the sense that one modality analysis process is affected by the analysis of the other modality. Firstly, (DBLP:journals/corr/ElliottK17) proposed an architecture with an encoder shared between two decoders : one to output a translated sentence and one to reconstruct (imagine as the authors say) the image features. The encoder was thus trained to learn grounded representation. Secondly, (delbrouck2017multimodal) used a grounded attention mechanism (referred as "pre-attention") where the image features were refined according to the encoder’s representation of the source sentence.
2 Monomodal (Text-based) MT model
Our model is based on an encoder-decoder architecture with attention mechanism (BahdanauCB14)
. The encoder is a bi-directional RNN with Gated Recurrent Unit (GRU) layersChungGCB14; cho-al-emnlp14. A forward RNN and a backward RNN both read an input sequence , ordered from to and from to respectively. Each RNN produces a hidden state for each word . We create a sequence of annotations , where denotes the concatenation operation. Therefore, each annotation now contains the summaries of both the preceding words and the following words.
is a CGRU (two stacked GRUs) that predicts the probability of a target sequencebased on . At each decoding step , an unnormalized attention score is computed for each source annotation using the first GRU’s hidden state and itself (equation 1):
The attention vectoris calculated as a weighted average of the source states as shown in equation 3. The second GRU computes the final state of the decoder with and . The decoder outputs a distribution over a vocabulary of fixed-size V based on the recurrent state of the second GRU , the previous words , and the attention vector :
The whole model is trained end-to-end by minimizing the negative log likelihood of the target words using stochastic gradient descent.
3 Our multimodal MT model
As stated in the introduction, the convolutional network extracting the image features is now part of the training procedure. We chose a residual network (ResNet) who iteratively refines a representation by adding pass-through routing so that layers receive more detailed information rather than the information processed by the previous layer or adjacent to it. This modification enables to train deep convolutional networks without suffering too much from the vanishing gradient problem.
3.1 Residual Network
ResNets are built from residual blocks:
Here, and are the input and output vectors of the layers considered. The function is the residual mapping to be learned. For an example, if we consider two layers, where
denotes ReLu function. The operationis the shortcut connection and consists of an element-wise addition. Therefore, the dimensions of and must be equal. When this is not the case (e.g., when changing the input/output channels), the matrix performs a linear projection by the shortcut connections to match the dimension. Finally, it performs a last second nonlinearity after the addition (i.e., . A group of blocks are stacked to form a stage of computation. The general ResNet architecture starts with a single convolutional layer followed by 4 stages of computation.
3.2 Conditional Batch Normalization
A ResNet adopts batch normalization (BN)(ICML-2015-IoffeS) right after each convolution and before activation. This techniques tackles the problem of internal covariate shift (the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change) and addresses it by normalizing layer inputs :
The network applies the above equation 6 to make each feature dimension of the input
in the whole mini-batch follow a zero mean and unit variance Gaussian. On top of that, the model has the opportunity to shift and scale the result as shown in equation7 before going through the the non-linearity (ReLu). At inference time, the batch mean and variance are replaced by a single empirical mean and variance of activations during training.
To modulate the visual processing by language, we will predict a small change in the shift and scale parameters of equation 7 according to the text-based source annotations sequence as already been proposed in the related VQA task (modul) (called "Modulated ResNet" in the author’s paper). We call this conditional batch normalization. To do so, we use a one-hidden-layer MLP to predict these deltas for all feature
maps within the layer:
3.3 Image Features
This section aims to explain which image features are extracted from the ResNet and how the model described in section 2 uses it. Commonly, two types of features are useful for machine translation: global pooled features (a vector of features) and convolutional features (also called an attention map, a 3D-matrice). Because we use one or the other, our model now has two variants referred as "pool5" and "conv" in the results section.
3.3.1 Global pool5 Features
In the ResNet architecture, at the end of the 4th stage sits a max-pooling layer just before the fully connected layer whose output is a global 2048-dimensional visual representation of the image. We use to modulates each source annotation using element-wise multiplication (as done in (caglayan-EtAl:2017:WMT)):
Because is a vector of features, a "pool5" model does not need a second attention mechanism.
3.3.2 Convolutional Features
At the end of the ResNet 3rd stage, after the ReLu activation (res4f), we extract convolutional feature maps of 7x7x1024 (the 3D matrice) that are regarded as 49 spatial annotations of 1024-dimension each. We use a soft attention mechanism over the 49 visual spatial locations at each decoding step . It is the exact same mechanism of section 2 but with replaced by :
4 Encoder-based image attention
In machine translation, any image attention mechanism on convolutional features – soft (BahdanauCB14), local or stochastic (delbrouck-dupont:2017:EMNLP2017) – happens on the decoder-side (based on as seen in the previous section 3.3.2). At each time-step , the decoder has to decide which spatial features are interesting to decode the next translated token. However, when it comes to translate a sentence in real life, we rather tend to imagine a visual representation as soon as we read the source sentence. The encoder should probably be the strongest place to build a strong visual representations for our translation task. This hypothesis is reinforced by the additional role the encoder now endorse: modulating the visual processing as described in section 3.2. Because the encoder now plays a part in the making of these convolutional features, we propose to apply the attention mechanism for the visual representation during the encoding.
As shown in Figure 1, the encoder now builds, at every time-step , a textual representation and a visual representation of the word . The encoder visual attention module is similar to the one described in subsection 3.3.2, but takes place in the encoder. Therefore, equation 12 does not depend on the decoder state anymore but on the source annotation . Now the decoder, at every decoding-step , still computes a soft alignment over the source sentence annotations and hence gets, on its way, the visual representations as well. In contrast to earlier works on multimodal MT, the decoder is now equipped with only one multimodal attention mechanism, as the textual and visual representation of a word are computed in the encoder.
We used the Multi30K dataset elliott-EtAl:2016:VL16 which is an extended version of the Flickr30K Entities. For each image, one of the English descriptions was selected and manually translated into German by a professional translator. As training and development data, 29,000 and 1,014 triples are used respectively. We dispose of three test sets to score our models. The flickr Test2016 and the Flickr Test2017 set contain 1000 image-caption pairs and the ambiguous MSCOCO test set (ElliottFrankBarraultBougaresSpecia2017) 461 pairs.
6 Experiments and Results
Previous work (delbrouck2017multimodal) showed that using visual features from different CNNs lead to variable translation performance for a same encoder-decoder model. In the this paper, we stick to two different versions of ResNet : the ResNet v1 detailed in our section 3.2 and ResNet v2, a slight variant of ResNet v1 as described in (He2016). The image preprocessing operation is described in Appendix A.
|Model||Test 2016||Test 2017||Ambiguous COCO|
|Pre-trained Pool5* (caglayan-EtAl:2017:WMT)||38.4 0.3||57.8 0.5||31.1 0.7||51.9 0.2||27.0 0.7||47.1 0.7|
|RN v1 CBN Conv||38.9 0.3||57.1 0.6||30.0 1.1||50.9 0.2||26.3 0.9||46.5 0.6|
|RN v1 CBN Pool5||39.4 0.8||57.9 0.6||31.5 0.4||52.2 0.5||27.4 0.9||48.1 0.6|
|RN v2 CBN Pool5||38.7 0.3||56.5 0.5||30.1 0.7||51.1 0.6||26.5 0.7||46.3 0.6|
|RN v1 CBN FT Pool5||38.2 0.6||57.5 0.6||30.4 0.6||51.4 0.7||26.4 0.9||46.8 0.4|
|RN v1 CBN enc-att||40.5 0.8||57.9 0.6||31.4 0.4||52.5 0.7||27.3 0.9||48.5 0.4|
Both ResNets are pretrained on ImageNet. Unless stated otherwise, ResNet parameters are frozen during training, including scalarsand from section 3.2. We use the metrics BLEU Papineni:2002 and METEOR meteor-wmt:2014 to evaluate the quality of our models’ translations. We stop a training if there is no METEOR improvements on the dev-set for 10k steps.
First and foremost, we can notice that conditional batch normalization enhances our model translations, specifically when using the global features (RN v1 CBN Pool5). Applying CBN at every ResNet stage lead to the best improvement (cfr. table 4 in Appendix C) but we also find that fine-tuning the last layer does not improve this performance (RN v1 CBN FT Pool5). This result reinforce our main postulate that modulating the visual process by language enhance the quality of the translations.
Secondly, when using decoder-based attention on convolutional features as described in section 3.3.2, the model performs poorly (RN v1 CBN Conv). As stated in the introduction, it’s not sure we have enough data to successfully train an attention model. Nevertheless, using the encoder-based attention (section 4) palliates this gap. Indeed, both models RN v1 CBN enc-att and RN v1 CBN Conv have very close results.
Lastly, using a ResNet v2 slightly deteriorates the results. The key difference between the two architectures is the use of batch normalization before every weight layer. A more in-depth study of the model parameters and architecture might be needed to figure out the cause of this small drop. Another possible future work would be the use larger images as ResNet inputs (448x448) to enjoy convolutional features of 196 spatial locations, as this has shown great success in VQA.
This work was partly supported by the Chist-Era project IGLU with contribution from the Belgian Fonds de la Recherche Scientique (FNRS), contract no. R.50.11.15.F, and by the FSO project VCYCLE with contribution from the Belgian Waloon Region, contract no. 1510501.
Appendix A ResNets
ResNet v1 has been trained on ImageNet with a vgg preprocessing. It consists of a random crop and a random flip. ResNet v2 applies the inception preprocessing that uses, on top of the vgg preprocessing, random color distortion (hue, contrast, brightness and saturation). Input image size for ResNet v1 and v2 are respectively of 224x224x3 and 299x299x3.
|ResNet v1 version||ResNet-50|
|ResNet v2 version||ResNet-50|
|ResNet v1 input size||224x224x3|
|ResNet v2 input size||299x299x3|
|ResNet v1 CBN inference||moving average|
|ResNet v2 CBN inference||exponential moving average|
|CBN damping factor||1e-5|
|CBN MLP hidden units||512|
|Blocks with CBN||all (by default)|
Appendix B Sequence to sequence model
To conduct our experiments, we use the TensorFlow[abadi2016tensorflow] library as well as the google seq2seq framework [Britz:2017]. We release our code on github 111https://github.com/jbdel/mmt_cbn. We normalize and tokenize English and German descriptions using the Moses tokenizer scripts Koehn:2007. We use the byte pair encoding algorithm on the train set to convert space-separated tokens into subwords [sennrich2016subword] with 10K merge operation, reducing our vocabulary size to 5234 and 7052 words for English and German respectively. Embeddings are learned along with the model.
Both encoders are equipped with layer normalization [ba2016layer] where each hidden unit adaptively normalizes its incoming activations with a learnable gain and bias.
We initialize the decoder hidden state
of the CGRU with a non-linear transformation of the average source annotation:
The decoder is a conditional GRU 222https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf that consists of two stacked GRU activations called and and an attention mechanism in between (called ATT in the footnote paper). At each time-step , REC1 firstly computes a hidden state proposal based on the previous hidden state and the previous emitted word :
Then, the attention mechanism computes over the source sentence using the annotations sequence and the intermediate hidden state proposal (cfr. section 2).
Finally, the second recurrent cell , computes the hidden state of the cGRU by looking at the intermediate representation and context vector :
|Source and target embeddings||128|
|GRU and CGRU Layer size||256|
|GRU input dropout||0.7|
|GRU output dropout||0.5|
|CGRU input dropout||1.0|
|CGRU output dropout||1.0|
|Softmax output dropout 4||0.5|
Appendix C Further Results
|RN v1 CBN Pool5||Test 2016||Test 2017||Ambiguous COCO|
|All||39.4 0.8||57.9 0.6||31.5 0.4||52.2 0.5||27.4 0.9||48.1 0.6|
|Stages 2 - 4||38.7 0.6||57.0 0.7||31.4 0.9||51.4 0.4||27.3 0.9||46.8 0.9|
|Stages 3 - 4||38.8 0.8||56.1 0.7||30.4 0.8||51.1 0.6||25.9 1.0||46.0 0.6|