LIUM-CVC Submissions for WMT17 Multimodal Translation Task

This paper describes the monomodal and multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT17 Shared Task on Multimodal Translation. We mainly explored two multimodal architectures where either global visual features or convolutional feature maps are integrated in order to benefit from visual context. Our final systems ranked first for both En-De and En-Fr language pairs according to the automatic evaluation metrics METEOR and BLEU.


page 1

page 2

page 3

page 4


LIUM-CVC Submissions for WMT18 Multimodal Translation Task

This paper describes the multimodal Neural Machine Translation systems d...

Does Multimodality Help Human and Machine for Translation and Image Captioning?

This paper presents the systems developed by LIUM and CVC for the WMT16 ...

The MeMAD Submission to the WMT18 Multimodal Translation Task

This paper describes the MeMAD project entry to the WMT Multimodal Machi...

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

In Multimodal Neural Machine Translation (MNMT), a neural model generate...

Transformer-based Cascaded Multimodal Speech Translation

This paper describes the cascaded multimodal speech translation systems ...

Ensemble Sequence Level Training for Multimodal MT: OSU-Baidu WMT18 Multimodal Machine Translation System Report

This paper describes multimodal machine translation systems developed jo...

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Recently, there has been a surge in research in multimodal machine trans...

1 Introduction

With the recent advances in deep learning, purely neural approaches to machine translation, such as Neural Machine Translation (NMT),

(Sutskever et al., 2014; Bahdanau et al., 2014) have received a lot of attention because of their competitive performance (Toral and Sánchez-Cartagena, 2017). Another reason for the popularity of NMT is its flexible nature allowing researchers to fuse auxiliary information sources in order to design sophisticated networks like multi-task, multi-way and multi-lingual systems to name a few (Luong et al., 2015; Johnson et al., 2016; Firat et al., 2017).

Multimodal Machine Translation (MMT) aims to achieve better translation performance by visually grounding the textual representations. Recently, a new shared task on Multimodal Machine Translation and Crosslingual Image Captioning (CIC) was proposed along with WMT16 (Specia et al., 2016). In this paper, we present MMT systems jointly designed by LIUM and CVC for the second edition of this task within WMT17.

Last year we proposed a multimodal attention mechanism where two different attention distributions were estimated over textual and image representations using

shared transformations (Caglayan et al., 2016a). More specifically, convolutional feature maps extracted from a ResNet-50 CNN (He et al., 2016)

pre-trained on the ImageNet classification task

(Russakovsky et al., 2015) were used to represent visual information. Although our submission ranked first among multimodal systems for CIC task, it was not able to improve over purely textual NMT baselines in neither tasks (Specia et al., 2016). The winning submission for MMT (Caglayan et al., 2016a) was a phrase-based MT system rescored using a language model enriched with FC

global visual features extracted from a pre-trained VGG-19 CNN

(Simonyan and Zisserman, 2014).

State-of-the-art results were obtained after WMT16 by using a separate attention mechanism for different modalities in the context of CIC (Caglayan et al., 2016b) and MMT (Calixto et al., 2017a). Besides experimenting with multimodal attention, Calixto et al. (2017a) and Libovický and Helcl (2017) also proposed a gating extension inspired from Xu et al. (2015) which is believed to allow the decoder to learn when to attend to a particular modality although Libovický and Helcl (2017) report no improvement over baseline NMT.

There have also been attempts to benefit from different types of visual information instead of relying on features extracted from a CNN pre-trained on ImageNet. One such study from Huang et al. (2016) extended the sequence of source embeddings consumed by the RNN with several regional features extracted from a region-proposal network (Ren et al., 2015). The architecture thus predicts a single attention distribution over a sequence of mixed-modality representations leading to significant improvement over their NMT baseline.

More recently, a radically different multi-task architecture called Imagination (Elliott and Kádár, 2017) is proposed to learn visually grounded representations by sharing an encoder between two tasks: a classical encoder-decoder NMT and a visual feature reconstruction using as input the source sentence representation.

This year, we experiment111A detailed tutorial for reproducing the results of this paper is provided at

with both convolutional and global visual vectors provided by the organizers to better exploit multimodality (Section 

3). Data preprocessing for both English{German,French} and training hyper-parameters are detailed respectively in Section 2 and Section 4. The results based on automatic evaluation metrics are reported in Section 5. The paper ends with a discussion in Section 6.

2 Data

We use the Multi30k (Elliott et al., 2016) dataset provided by the organizers which contains 29000, 1014 and 1000 English{German,French} image-caption pairs respectively for training, validation and Test2016 (the official evaluation set of WMT16 campaign) set. Following task rules we normalized punctuations, applied tokenization and lowercasing. A Byte Pair Encoding (BPE) model (Sennrich et al., 2016) with 10K merge operations is learned for each language pair resulting in 52347052 tokens for EnglishGerman and 59456547 tokens for EnglishFrench respectively.

We report results on Flickr Test2017 set containing 1000 image-caption pairs and the optional MSCOCO test set of 461 image-caption pairs which is considered as an out-of-domain set with ambiguous verbs.

Image Features

We experimented with several types of visual representation using deep features extracted from convolutional neural networks (CNN) trained on large visual datasets. Following the current state-of-the-art in visual representation, we used a network with the ResNet-50 architecture

(He et al., 2016) trained on the ImageNet dataset (Russakovsky et al., 2015) to extract two types of features: the 2048-dimensional features from the pool5 layer and the 14x14x1024 features from the res4f_relu layer. Note that the former is a global feature while the latter is a feature map with roughly localized spatial information.

3 Architecture

Our baseline NMT is an attentive encoder-decoder (Bahdanau et al., 2014) variant with a Conditional GRU (CGRU) (Firat and Cho, 2016) decoder.

Let us denote source and target sequences and with respective lengths and as follows where and are embeddings of dimension E:


Two GRU (Chung et al., 2014) encoders with R hidden units each, process the source sequence in forward and backward directions. Their hidden states are concatenated to form a set of source annotations where each element is a vector of dimension :

Both encoders are equipped with layer normalization (Ba et al., 2016) where each hidden unit adaptively normalizes its incoming activations with a learnable gain and bias.


A decoder block namely CGRU (two stacked GRUs where the hidden state of the first GRU is used for attention computation) is used to estimate a probability distribution over target tokens at each decoding step


The hidden state

of the CGRU is initialized using a non-linear transformation of the average source annotation:


At each decoding timestep , an unnormalized attention score is computed for each source annotation using the first GRU’s hidden state and itself:
(, and )


The context vector is a weighted sum of and its respective attention probability obtained using a softmax operation over all the unnormalized scores:

The final hidden state is computed by the second GRU using the context vector and the hidden state of the first GRU .


The probability distribution over the target tokens is conditioned on the previous token embedding , the hidden state of the decoder and the context vector , the latter two transformed with and respectively:

3.1 Multimodal NMT

3.1.1 Convolutional Features

The fusion-conv architecture extends the CGRU decoder to a multimodal decoder (Caglayan et al., 2016b) where convolutional feature maps of 14x14x1024 are regarded as 196 spatial annotations of 1024-dimension each. For each spatial annotation, an unnormalized attention score is computed (Equation 2) except that the weights and biases are specific to the visual modality and thus not shared with the textual attention:

The visual context vector is computed as a weighted sum of the spatial annotations and their respective attention probabilities :

The output of the network is now conditioned on a multimodal context vector which is the concatenation of the original context vector and the newly computed visual context vector .

3.1.2 Global pool5 Features

In this section, we present 5 architectures guided with global 2048-dimensional visual representation in different ways. In contrast to the baseline NMT, the decoder’s hidden state is initialized with an all-zero vector unless otherwise specified.


initializes the decoder with by replacing Equation 1 with the following:

(Calixto et al., 2017b) previously explored a similar configuration (IMG) where the decoder is initialized with the sum of global visual features extracted from FC7 layer of a pre-trained VGG-19 CNN and the last source annotation.


initializes the bi-directional encoder and the decoder with where represents the initial state of encoder (Note that in the baseline NMT, is an all-zero vector) :


modulates each source annotation with using element-wise multiplication:


modulates each target embedding with using element-wise multiplication:


combines the latter two architectures with dec-init and uses separate transformation layers for each of them:

4 Training

We use ADAM (Kingma and Ba, 2014) with a learning rate of and a batch size of 32. All weights are initialized using Xavier method (Glorot and Bengio, 2010) and the total gradient norm is clipped to 5 (Pascanu et al., 2013). Dropout (Srivastava et al., 2014) is enabled after source embeddings , source annotations and pre-softmax activations with dropout probabilities of respectively. ( for EnFr.) An L regularization term with a factor of is also applied to avoid overfitting unless otherwise stated. Finally, we set E=128 and R=256 (Section 3) respectively for embedding and GRU dimensions.

EnDe Flickr # Params Test2016 (Ensemble) Test2017 (Ensemble)
Caglayan et al. (2016a) 62.0M 29.2 48.5
Huang et al. (2016) - 36.5 54.1
Calixto et al. (2017a) 213M 36.5 55.0
Calixto et al. (2017b) - 37.3 55.1
Elliott and Kádár (2017) - 36.8 55.8
Baseline NMT 4.6M 38.1 0.8 / 40.7 57.3 0.5 / 59.2 30.8 1.0 / 33.2 51.6 0.5 / 53.8
(D1) fusion-conv 6.0M 37.0 0.8 / 39.9 57.0 0.3 / 59.1 29.8 0.9 / 32.7 51.2 0.3 / 53.4
(D2) dec-init-ctx-trg-mul 6.3M 38.0 0.9 / 40.2 57.3 0.3 / 59.3 30.9 1.0 / 33.2 51.4 0.3 / 53.7
(D3) dec-init 5.0M 38.8 0.5 / 41.2 57.5 0.2 / 59.4 31.2 0.7 / 33.4 51.3 0.3 / 53.2
(D4) encdec-init 5.0M 38.2 0.7 / 40.6 57.6 0.3 / 59.5 31.4 0.4 / 33.5 51.9 0.4 / 53.7
(D5) ctx-mul 4.6M 38.4 0.3 / 40.4 57.8 0.5 / 59.6 31.1 0.7 / 33.5 51.9 0.2 / 53.8
(D6) trg-mul 4.7M 37.8 0.9 / 41.0 57.7 0.5 / 60.4 30.7 1.0 / 33.4 52.2 0.4 / 54.0
Table 1: Flickr EnDe results: underlined METEOR scores are from systems significantly different (-value ) than the baseline using the approximate randomization test of multeval for 5 runs. (D6) is the official submission of LIUM-CVC.

All models are implemented and trained with the nmtpy framework222 (Caglayan et al., 2017)

using Theano v0.9

(Theano Development Team, 2016)

. Each experiment is repeated with 5 different seeds to mitigate the variance of BLEU

(Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) and to benefit from ensembling. The training is early stopped if validation set METEOR does not improve for 10 validations performed per 1000 updates. A beam-search with a beam size of 12 is used for translation decoding.

5 Results

All results are computed using multeval (Clark et al., 2011) with tokenized sentences.

5.1 EnDe

Table 1 summarizes BLEU and METEOR scores obtained by our systems. It should be noted that since we trained each system with 5 different seeds, we report results obtained by ensembling 5 runs as well as the mean/deviation over these 5 runs. The final system to be submitted is selected based on ensemble Test2016 METEOR.

First of all, multimodal systems which use global pool5 features generally obtain comparable scores which are better than the baseline NMT in contrast to fusion-conv which fails to improve over it. Our submitted system (D6) achieves an ensembling score of 60.4 METEOR which is 1.2 better than NMT. Although the improvements are smaller, (D6) is still the best system on Test2017 in terms of ensembling/mean METEOR scores. One interesting point to be stressed at this level is that in terms of mean BLEU, (D6) performs worse than baseline on both test sets. Similarly, (D3) which has the best BLEU on Test2016, is the worst system on Test2017 according to METEOR. This is clearly a discrepancy between these metrics where an improvement in one does not necessarily yield an improvement in the other.

EnDe MSCOCO (Ensemble)
Baseline NMT 26.4 0.2 / 28.7 46.8 0.7 / 48.9
(D1) fusion-conv 25.1 0.7 / 28.0 46.0 0.6 / 48.0
(D2) dec-init-ctx-trg-mul 26.3 0.9 / 28.8 46.5 0.4 / 48.5
(D3) dec-init 26.8 0.5 / 28.8 46.5 0.6 / 48.4
(D4) encdec-init 27.1 0.9 / 29.4 47.2 0.6 / 49.2
(D5) ctx-mul 27.0 0.7 / 29.3 47.1 0.7 / 48.7
(D6) trg-mul 26.4 0.9 / 28.5 47.4 0.3 / 48.8
Table 2: MSCOCO EnDe results: the best Flickr system trg-mul (Table 1) has been used for this submission as well.
EnFr Test2016 ( / Ensemble) Test2017 ( / Ensemble)
Baseline NMT 52.5 0.3 / 54.3 69.6 0.1 / 71.3 50.4 0.9 / 53.0 67.5 0.7 / 69.8

(F1) NMT + nol2reg
52.6 0.8 / 55.3 69.6 0.6 / 71.7 50.0 0.9 / 52.5 67.6 0.7 / 70.0

(F2) fusion-conv
53.5 0.8 / 56.5 70.4 0.6 / 72.8 51.6 0.9 / 55.5 68.6 0.7 / 71.7

(F3) dec-init
54.5 0.8 / 56.7 71.2 0.4 / 73.0 52.7 0.9 / 55.5 69.4 0.7 / 71.9

(F4) ctx-mul
54.6 0.8 / 56.7 71.4 0.6 / 73.0 52.6 0.9 / 55.7 69.5 0.7 / 71.9

(F5) trg-mul
54.7 0.8 / 56.7 71.3 0.6 / 73.0 52.7 0.9 / 55.5 69.5 0.7 / 71.7

54.6 71.6 53.3 70.1

57.4 73.6 55.9 72.2

Table 3: Flickr En

Fr results: Scores are averages over 5 runs and given with their standard deviation (

) and the score obtained by ensembling the 5 runs. ens-nmt-7 and ens-mmt-6 are the submitted ensembles which correspond to the combination of 7 monomodal and 6 multimodal (global pool5) systems, respectively.

For the MSCOCO set no held-out set for model selection was available. Therefore, we submitted the system (D6) with best METEOR on Flickr Test2016.

After scoring all the available systems (Table 2) we observe that (D4) is the best system according to ensemble metrics. This can be explained by the out-of-domain/ambiguous nature of MSCOCO where best generalization performance on Flickr is not necessarily transferred to this set.

Overall, (D4), (D5) and (D6) are the top systems according to METEOR on Flickr and MSCOCO test sets.

5.2 EnFr

Table 3 shows the results of our systems on the official test set of last year (Test2016) and this year (test2017). F1 is a variant of the baseline NMT without regularization. F2 is a multimodal system using convolutional feature maps as visual features while F3 to F5 are multimodal systems using pool5 global visual features. We note that all multimodal systems perform better than monomodal ones.

Compared to the MMT 2016 results, we can see that the fusion-conv (F2) system with separate attention over both modalities achieve better performance than monomodal systems. The results are further improved by systems F3 to F5 which use pool5 global visual features. We conjecture that the way of integrating the global visual features into these systems does not seem to affect the final results since they all perform equally well on both test sets.

The submitted systems are presented in the last two lines of Table 3. Since we did not have all 5 runs with different seeds ready by the submission deadline, heterogeneous ensembles of different architectures and different seeds were considered. ens-nmt-7 (contrastive monomodal submission) and ens-mmt-6 (primary multimodal submission) correspond to ensembles of 7 monomodal and 6 multimodal (pool5) systems respectively. ens-mmt-6 benefits from the heterogeneity of the included systems resulting in a slight improvement of BLEU and METEOR.

EnFr MSCOCO ( / ensemble)
Baseline NMT 41.2 1.2 / 43.3 61.3 0.9 / 63.3
(F1) NMT + nol2reg 40.6 1.2 / 43.5 61.1 0.9 / 63.7
(F2) fusion-conv 43.2 1.2 / 45.9 63.1 0.9 / 65.6
(F3) dec-init 43.3 1.2 / 46.2 63.4 0.9 / 66.0
(F4) ctx-mul 43.3 1.2 / 45.6 63.4 0.9 / 65.4
(F5) trg-mul 43.5 1.2 / 45.5 63.2 0.9 / 65.1
ens-nmt-7 43.6 63.4
ens-mmt-6 45.9 65.9
Table 4: MSCOCO EnFr results: ens-mmt-6, the best performing ensemble on Test2016 corpus (see Table 3) has been used for this submission as well.

Results on the ambiguous dataset extracted from MSCOCO are presented in Table 4. We can observe a slightly different behaviour compared to the results in Table 3. The systems using the convolutional features are performing equally well compared to those using pool5 features. One should note that no specific tuning was performed for this additional task since no specific validation data was provided.

6 Conclusion

We have presented the LIUM-CVC systems for English to German and English to French Multimodal Machine Translation evaluation campaign. Our systems were ranked first for both tasks in terms of automatic metrics. Using the pool5 global visual features resulted in a better performance compared to multimodal attention architecture which makes use of convolutional features. This might be explained by the fact that the attention mechanism over spatial feature vectors cannot capture useful information from the extracted features maps. Another explanation for this is that source sentences contain most necessary information to produce the translation and the visual content is only useful to disambiguate a few specific cases. We also believe that reducing the number of parameters aggressively to around 5M allowed us to avoid overfitting leading to better scores in overall.


This work was supported by the French National Research Agency (ANR) through the CHIST-ERA M2CR project333, under the contract number ANR-15-CHR2-0006-01 and by MINECO through APCIN 2015 under the contract number PCIN-2015-251.