UMONS Submission for WMT18 Multimodal Translation Task

by   Jean-Benoit Delbrouck, et al.

This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18). We explore a novel architecture, called deepGRU, based on recent findings in the related task of Neural Image Captioning (NIC). The models presented in the following sections lead to the best METEOR translation score for both constrained (English, image) -> German and (English, image) -> French sub-tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


LIUM-CVC Submissions for WMT18 Multimodal Translation Task

This paper describes the multimodal Neural Machine Translation systems d...

OSU Multimodal Machine Translation System Report

This paper describes Oregon State University's submissions to the shared...

Multimodal Machine Translation with Reinforcement Learning

Multimodal machine translation is one of the applications that integrate...

CUNI System for the WMT17 Multimodal Translation Task

In this paper, we describe our submissions to the WMT17 Multimodal Trans...

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

In Multimodal Neural Machine Translation (MNMT), a neural model generate...

Ensemble Sequence Level Training for Multimodal MT: OSU-Baidu WMT18 Multimodal Machine Translation System Report

This paper describes multimodal machine translation systems developed jo...

Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation

Visual Genome is a dataset connecting structured image information with ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the field of Machine Translation (MT), the efficient integration of multimodal information still remains a challenging task. It requires combining diverse modality vector representations with each other. These vector representations, also called context vectors, are computed in order the capture the most relevant information in a modality to output the best translation of a sentence.

To investigate the effectiveness of information obtained from images, a multimodal neural machine translation (MNMT) shared task

Specia et al. (2016) has been introduced to the community.111

Even though soft attention models had been extensively studied in MNMT

Delbrouck and Dupont (2017a); Caglayan et al. (2016); Calixto et al. (2017), the most successful recent work (Caglayan et al., 2017a)

focused on using the max-pooled features extracted from a convolutional network to modulate some components of the system (i.e. the target embeddings). Convolutional features or attention maps recently showed some success

Delbrouck and Dupont (2017b) in a encoder-based attention model conditioned on the source encoder representation. Both model types lead to similar results, the latter being slightly complex and taking longer to train. One similar feature they share is that the proposed models remain relatively small. Indeed, the number of trainable parameters seems “upper bounded” due to the number of unique training examples being limited (cfr. Section 4). Heavy or complex attention models on visual features showed premature convergence and restricted scalability.

The model proposed by the University of Mons (UMONS) in 2018 is called DeepGRU, a novel idea based on the previously investigated conditional GRU (cGRU).222 We enrich the architecture with three ideas borrowed from the closely related NIC task: a third GRU as bottleneck function, a multimodal projection and the use of gated tanh activation. We make sure to keep the overall model light, efficient and rapid to train. We start by describing the baseline model in Section 2 followed by the three aforementioned NIC upgrades which make up our deepGRU model in Section 3. Finally, we present the data made available by the Multimodal Machine Translation Task in Section 4 and the results in section 5, then engage a quick discussion in Section 6.

2 Baseline Architecture

Given a source sentence and an image , an attention-based encoder-decoder model Bahdanau et al. (2014) outputs the translated sentence . If we denote as the model parameters, then is learned by maximizing the likelihood of the observed sequence or in other words by minimizing the cross entropy loss. The objective function is given by:


Three main components are involved: an encoder, a decoder and an attention model.

Encoder  At every time-step , an encoder creates an annotation according to the current embedded word and internal state :


Every word of the source sequence is an index in the embedding matrix so that the following formula maps the word to the size :


The total size of the embeddings matrix depends on the source vocabulary size and the embedding dimension such that . The mapping matrix also depends on the embedding dimension because .

The encoder function is a bi-directional GRU Cho et al. (2014). The following equations define a single GRU block (called for future references) :


where . Our encoder consists of two GRUs, one is reading the source sentence from 1 to M and the second from M to 1. The final encoder annotation for timestep becomes the concatenation of both GRUs annotations . Therefore, the encoder set of annotations is of size .

Decoder  At every time-step

, a decoder outputs probabilities

over the target vocabulary according to previously generated word , internal state and image :


Every word of the target sequence is an index in the embedding matrix so that the following formula maps the word in the size :


The decoder function is a conditional GRU (cGRU). The following equations describes a cGRU cell :


where is the visual attention module over the set of source annotation and pooled vector of ResNet-50 features extracted from image . More precisely, our attention model is the product between the so-called soft attention over the source annotations

and the linear transformation over pooled vector

of image I :


The bottleneck function projects the cGRU output into probabilities over the target vocabulary. It is defined so:


where denotes the concatenation operation.

3 DeepGRU

The deepGRU decoder Delbrouck and Dupont (2018) is a variant of the cGRU decoder.

Gated hyperbolic tangent  First, we make use of the gated hyperbolic tangent activation Teney et al. (2017) instead of tanh. This non-linear layer implements a function with parameters defined as follows:


where . We apply this gating system for equation 11 and 13.

GRU bottleneck  When working with small dimensions, one can afford to replace the computation of of equation 13 by a new gru block :


The GRU bottleneck can be seen as a new block encoding the visual information with its surrounding context ( and ). Therefore, equation 12 is not computed with anymore so that the second block encodes the textual information only.

Multimodal projection  Because we now have a linguistic GRU block and a visual GRU block, we want both representations to have their own projection to compute the candidate probabilities. Equation 13 and 14 becomes:


where comes from equation 16. Note that we use the gated hyperbolic tangent for equation 16 and 17.

4 Data and settings

The Multi30K dataset (Elliott et al., 2016) is provided by the challenge. For each image, one of the English descriptions was selected and manually translated into German and French by a professional translator. As training and development data, 29,000 and 1,014 triples are used respectively. We use the three available test sets to score our models. The Flickr Test2016 and the Flickr Test2017 set contain 1000 image-caption pairs and the ambiguous MSCOCO test set (Elliott et al., 2017) 461 pairs. For the WMT18 challenge, a new Flickr Test2018 set of 1,071 sentences is released without the German and French gold translations.

Marices of the model are initialized using the Xavier method Glorot and Bengio (2010) and the gradient norm is clipped to 5. We chose ADAM (Kingma and Ba, 2014) as the optimizer with a learning rate of 0.0004 and batch-size 32. To marginally reduce our vocabulary size, we use the byte pair encoding (BPE) algorithm on the train set to convert space-separated tokens into sub-words Sennrich et al. (2016). With 10K merge operations, the resulting vocabulary sizes of each language pair are: 5204 7067 tokens for English German and 5835 6577 tokens for EnglishFrench.

We use the following regularization methods: we apply dropout of 0.3 on source embeddings , 0.5 on source annotations and 0.5 on both bottlenecks and . We also stop training when the METEOR score does not improve for 10 evaluations on the validation set (i.e. one validation is performed every 1000 model updates).

The dimensionality of the various settings and layers is as follows:
Embedding size is 128, encoder and decoder GRU size is 256, embedding layers are: , , .

Attention matrices: .

Bottleneck matrices: and projection matrices: . Weights and are tied.

The size of gated hyperbolic tangent weights depends on their respective application.

5 Results

Our models performance are evaluated according to the following automated metrics: BLEU-4 Papineni et al. (2002) and METEOR Denkowski and Lavie (2014). We decode with a beam-search of size 12 and use model ensembling of size 5 for German and 6 for French. We used the nmtpytorch Caglayan et al. (2017b) framework for all our experiments. We also release our code.333

Test 2016 Flickr
FR-Baseline 59.08 74.73
FR-DeepGRU 62.49 +3.41 76.83 +2.10
DE-Baseline 38.43 58.37
DE-DeepGRU 40.34 +1.91 59.58 +1.21
Test 2017 Flickr
FR-Baseline 51.86 72.75
FR-DeepGRU 55.13 +3.27 71.52 +1.98
DE-Baseline 30.80 52.33
DE-DeepGRU 32.57 +1.77 53.60 +1.27
Test 2017 COCO
FR-Baseline 43.31 64.39
FR-DeepGRU 46.16 +2.85 65.79 +1.40
DE-Baseline 26.30 48.45
DE-DeepGRU 29.21 +2.91 49.45 +1.00
Test 2018 Flickr
FR-DeepGRU 39.40 60.17
DE-DeepGRU 31.10 51.64

6 Conclusion and future work

The full leaderboard scores 444 shows close results and it seems that everybody converges towards the same translation quality score. A few questions arise. Did we reach —to some extent— the full potential of images related to the information they can provide? Should we try and add traditional machine translation techniques such as post-edition, since images have been exploited successfully? Another major step forward would be to successfully develop strong and stable models using convolutional features, the latter having 98 times more features than the max-pooled ones.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  • Caglayan et al. (2017a) Ozan Caglayan, Walid Aransa, Adrien Bardet, Mercedes García-Martínez, Fethi Bougares, Loïc Barrault, Marc Masana, Luis Herranz, and Joost van de Weijer. 2017a. Lium-cvc submissions for wmt17 multimodal translation task. In Proceedings of the Second Conference on Machine Translation, pages 432–439. Association for Computational Linguistics.
  • Caglayan et al. (2016) Ozan Caglayan, Walid Aransa, Yaxing Wang, Marc Masana, Mercedes Garcia-Martinez, Fethi Bougares, Loic Barrault, and Joost van de Weijer. 2016. Does multimodality help human and machine for translation and image captioning? arXiv preprint arXiv:1605.09186.
  • Caglayan et al. (2017b) Ozan Caglayan, Mercedes García-Martínez, Adrien Bardet, Walid Aransa, Fethi Bougares, and Loïc Barrault. 2017b. Nmtpy: A flexible toolkit for advanced neural machine translation systems. Prague Bull. Math. Linguistics, 109:15–28.
  • Calixto et al. (2017) Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1913–1924. Association for Computational Linguistics.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
  • Delbrouck and Dupont (2017a) Jean-Benoit Delbrouck and Stéphane Dupont. 2017a. An empirical study on the effectiveness of images in multimodal neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 910–919, Copenhagen, Denmark. Association for Computational Linguistics.
  • Delbrouck and Dupont (2017b) Jean-Benoit Delbrouck and Stéphane Dupont. 2017b. Modulating and attending the source image during encoding improves multimodal translation. CoRR, abs/1712.03449.
  • Delbrouck and Dupont (2018) Jean-Benoit Delbrouck and Stéphane Dupont. 2018. Bringing back simplicity and lightliness into neural image captioning. CoRR.
  • Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
  • Elliott et al. (2016) D. Elliott, S. Frank, K. Sima’an, and L. Specia. 2016. Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74.
  • Elliott et al. (2017) Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010.

    Understanding the difficulty of training deep feedforward neural networks.


    In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics

  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
  • Specia et al. (2016) Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation, pages 543–553, Berlin, Germany. Association for Computational Linguistics.
  • Teney et al. (2017) Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2017. Tips and tricks for visual question answering: Learnings from the 2017 challenge. CoRR, abs/1708.02711.