Towards Multimodal Simultaneous Neural Machine Translation

04/07/2020 ∙ by Aizhan Imankulova, et al. ∙ 0

Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding in multiple languages. This task is significantly harder than the general full sentence translation because of the shortage of input information during decoding. To alleviate this shortage, we propose multimodal simultaneous neural machine translation (MSNMT) which leverages visual information as an additional modality. Although the usefulness of images as an additional modality is moderate for full sentence translation, we verified, for the first time, its importance for simultaneous translation. Our experiments with the Multi30k dataset showed that MSNMT in a simultaneous setting significantly outperforms its text-only counterpart in situations where 5 or fewer input tokens are needed to begin translation. We then verified the importance of visual information during decoding by (a) performing an adversarial evaluation of MSNMT where we studied how models behave with incongruent input modality and (b) analyzing the image attention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Simultaneous translation is a natural language processing (NLP) task in which translation begins before receiving the whole source sentence. It is widely used in international summits and conferences where real-time comprehension is one of the most important aspects. Simultaneous translation is already a difficult task for human interpreters because the message must be understood and translated while the input sentence is still incomplete

Seeber (2015). Consequently, simultaneous translation is even more difficult for machines. Previous works attempt to solve this task by predicting the sentence-final verb Grissom II et al. (2014), or predicting unseen syntactic constituents Oda et al. (2015).

Figure 1: An overview of (a) vanilla NMT, (b) wait-k simultaneous NMT and (c) multimodal simultaneous machine translation based on wait-k approach incorporating visual clues for better EnDe translation (here ).

Given the difficulty of predicting future inputs based on existing limited inputs, Ma et al. (2019) proposed a simple simultaneous neural machine translation (SNMT) approach wait-k which generates the target sentence concurrently with the source sentence, but always k tokens behind, for given k satisfying latency requirements.

However, all existing approaches solve the given task only using the text modality, which may be insufficient to produce a reliable translation. Simultaneous interpreters often consider various additional information sources such as visual clues or acoustic data while translating Seeber (2015). Therefore, we hypothesize that using supplementary information, such as visual clues, can also be beneficial for simultaneous machine translation.

To this end, we propose Multimodal Simultaneous Neural Machine Translation (MSNMT) that supplements the incomplete textual modality with a visual modality, in the form of an image, during the decoding process to predict still missing information to improve the translation quality. Our research can be applied in various situations where visual information is related to the content of speech such as presentations that use slides (e.g. TED Talks111https://interactio.io/) and news video broadcasts222https://www.a.nhk-g.co.jp/bilingual-english/broadcast/nhk/index.html, etc. Our experiments show that the proposed MSNMT method achieves higher translation accuracy by leveraging image information than the SNMT model that does not use images. To the best of our knowledge, we are the first to propose the incorporation of visual information to solve the problem of incomplete text information in SNMT.

The main contributions of our research are:

  • We propose to combine multi-modal and simultaneous NMT and discover cases where such multimodal signals are beneficial for the end-task.

  • We show that the MSNMT approach significantly improves the quality of simultaneous translation by enriching incomplete text input information using visual clues.

  • By providing an adversarial evaluation for both text and image and a quantitative attention analysis, we showed that the models indeed depend on both textual and visual information.

2 Related Work

For simultaneous translation, it is crucial to predict the words that have not appeared yet to produce a translation. For example, it is important to distinguish nouns in SVO-SOV translation and verbs in SOV-SVO translation Ma et al. (2019). SNMT can be realized with two types of policy: fixed and adaptive policies Zheng et al. (2019). Most studies with adaptive policy to predict upcoming tokens include explicit prediction of the sentence-final verb Grissom II et al. (2014); Matsubara et al. (2000) and unseen syntactic constituents Oda et al. (2015). Such dynamic SNMT models Gu et al. (2017); Dalvi et al. (2018); Arivazhagan et al. (2019), which decide to READ/WRITE in one model, have the advantage of using input text information as effectively as possible due to the lack of such information in the first place. Meanwhile, ma-etal-2019-stacl proposed a simple wait-k method with fixed policy, which generates the target sentence only from the source sentence that is delayed by k tokens. However, their models for simultaneous translation so far rely only on the source sentence. In addition, in this research, we concentrate on the wait-k approach with fixed policy, so that the amount of input textual context can be controlled to better analyze whether multimodality is effective in SNMT.

Multimodal NMT (MNMT) for full-sentence machine translation has been developed to enrich text modality by using visual information Hitschler et al. (2016); Specia et al. (2016). While the improvement brought by visual features is moderate, their usefulness is proven by caglayan-etal-2019-probing. They showed that MNMT models are able to capture visual clues under limited textual context, where source sentences are synthetically degraded by color deprivation, entity masking, and progressive masking. However, they use an artificial setting where they deliberately deprive the models of source-side textual context by masking. However, our research has discovered an actual end-task and has shown the effectiveness of using multimodal data. Also, in their progressive masking experiments, they use a model exposed to only k words. In our case, a model eventually sees all text, generating each target tokens after taking every new source token after waiting for k words to start translating.

In MNMT, visual features are incorporated into standard machine translation in many ways. Doubly-attentive models are used to capture the textual and visual context vectors independently and then combine these context vectors in a concatenation manner 

Calixto et al. (2017) or hierarchical manner Libovický and Helcl (2017). Some studies use visual features in a multitask learning scenario Elliott and Kádár (2017); Zhou et al. (2018). Also, recent work on MNMT has partly addressed lexical ambiguity by using visual information Elliott et al. (2017); Lala and Specia (2018); Gella et al. (2019) showing that using textual context with visual features outperform unimodal models.

In our study, visual features are extracted using image processing techniques and then integrated into an SNMT model as additional information, which is supposed to be useful to predict missing words in a simultaneous translation scenario. To the best of our knowledge, this is the first work that incorporates external knowledge into an SNMT model.

3 Multimodal Simultaneous Neural Machine Translation Architecture

Our main goal in this paper is to investigate if image information would bring improvement on an SNMT. As a result, two separate tasks could benefit from each other by combining them. In order to do that, we chose to keep our experiments as pure as possible, without using additional data, or other types of models. It will allow us to control the amount of input textual context, so we can easily analyze the relationship between the amount of textual and visual information.

In this section, we describe our MSNMT model, which is composed by combining an SNMT Ma et al. (2019) framework and a multimodal model Libovický and Helcl (2017) (Figure 1 (c)). We base our model on the RNN architecture Libovický and Helcl (2017); Caglayan et al. (2017a). The models take a sentence and its corresponding image as inputs. The decoder of the MSNMT model outputs the target language sentence using a simultaneous translation mechanism by attaching attention not only to the source sentence but also to the image related to the source sentence.333Our code is publicly available at: https://anonymous.

3.1 Simultaneous Translation

We first briefly review standard NMT to set up the notations (see also Figure 1, (a)). The encoder of standard NMT model always takes the whole input sequence of length where each is a word embedding and produces source hidden states . The decoder predicts the next output token using and previously generated tokens, denoted . The final output is calculated using the following equation:

(1)

Different from standard neural translation, in which each is predicted using the entire source sentence , the simultaneous translation needs to translate concurrently with the growing source sentence. We incorporate the wait-k approach Ma et al. (2019) for our simultaneous translation model (Figure 1, (b)). Instead of waiting for the whole sentence before translating, this model waits for only the first k tokens and starts to generate each target tokens after taking every new source token one by one. It stops taking new input tokens once the whole input sentence is on board. For example, if , the first target token is predicted using the first 3 source tokens, and the second target token using the first 4 source tokens. The wait-k

decoding probability

is:

(2)

Where is the wait-k policy function which decides how much input text to read and translate, and is . is defined as follows:

(3)

When is over source length , is fixed to , which means the remaining target tokens (including current step) are generated using the full source sentence. For full sentence translation, is constant .

3.2 Multimodal Translation

We use a hierarchical attention combination technique Libovický and Helcl (2017) to incorporate visual and textual features into an MNMT model. This model calculates the independent context vectors from the textual features and the visual features , which are extracted by the textual encoder and the image processing model, respectively. It then combines the resulting two vectors using a second attention mechanism, which helps to perform simultaneous translation taking into account visual information.

Specifically, we compute the context vectors for each image () and text () modality independently using the following equations:

(4)
(5)
(6)

where is a feedforward network for each modality ; is -th decoder hidden state.

We project these image and text context vectors into a common space and compute another distribution over the projected context vectors and their corresponding weighted average using the second attention:

(7)
(8)
(9)

where is a feedforward network. Equation 8 calculates the second attention to combine the image and text vectors. is a weight matrix used to compute the context vector calculated from image and text features.

The final hypothesis has the probability:

(10)

where represents input image features.

3.3 Multimodal Simultaneous Neural Machine Translation

In this subsection, we describe the structure of the MSNMT model, which is a combination of the models described in Sections 3.1 and 3.2. The method for calculating the image context vector is the same as for MNMT; however, the text context vector (Equation 6) for the -th step is calculated as follows:

(11)

Thus is calculated from the input text prefix determined by wait-k policy function . Then we apply the second attention to and in order to calculate (Equation 9).

The decoding probability becomes as follows:

(12)

4 Experimental Setup

4.1 Dataset

We used the train, development, and test sets from the Multi30k Elliott et al. (2016) dataset published in the WMT16 Shared Task, which is a benchmark dataset generally used in multi-modal machine translation research Libovický and Helcl (2017); Caglayan et al. (2019).444Involving other types of data for training are out of the scope of this paper, however, they will be the next steps of this research. In addition to the test set provided by WMT16 (test2016), we also experimented on the test set from WMT17 Shared Task (test2017).

We experiment with our model in six translation directions consisting of 4 languages: English (En), German (De), French (Fr) and Czech (Cs). All language pairs include En on either of the sides. Data split for all pairs were as follows: training set, 29,000 sentence pairs; development set, 1,014 sentence pairs; 1,000 and 1,071 sentence pairs for tests 2016 and 2017, respectively. The average sentence length of this dataset is 12-13 tokens.

We limit the vocabulary size of the source and the target languages to 10,000 words. All sentences are preprocessed with lower-casing, tokenizing and normalizing the punctuation using the Moses script555We applied preprocessing using task1-tokenize.sh from https://github.com/multi30k/dataset.. Note that we provided experiments on word-level without using subwords such as BPE.

Visual features are extracted using pre-trained ResNet He et al. (2016). Technically, we encode all images in Multi30k with ResNet-50 and pick out the hidden state in the relu4f layer as 14 14 1,024-dimension visual features.

wait- EnDe DeEn EnFr FrEn EnCs CsEn
k S M S M S M S M S M S M
1 12.76 20.41 10.64 13.86 14.08 19.84 10.54 13.46 7.81 9.97 13.25 15.39
3 25.94 26.66 16.03 17.45 26.74 28.54 16.29 18.02 12.56 13.40 17.84 18.36
5 30.85 31.30 19.99 20.55 34.82 35.92 20.32 21.29 15.89 16.31 22.78 23.22
7 36.69 36.80 25.01 25.35 44.49 44.75 25.10 25.73 19.09 19.29 27.69 28.02
9 42.90 42.34 28.98 29.04 53.63 53.62 30.18 30.29 22.36 22.34 31.47 30.96
Full 54.80 53.92 36.59 35.59 73.28 72.00 43.09 42.27 29.25 28.79 36.14 35.39
Table 1: METEOR scores of SNMT (S) and MSNMT (M) models for six translation directions on test2016. Results are the average of three runs. Bold indicates the best METEOR score for each wait-k for each translation direction. “” indicates statistical significance of the improvement over SNMT.
wait- EnDe DeEn EnFr FrEn
k S M S M S M S M
1 7.32 16.19 9.83 11.92 12.65 17.65 8.75 12.05
3 20.79 21.47 13.91 15.43 22.53 24.59 14.03 15.59
5 25.00 25.74 17.44 19.13 30.36 30.96 17.72 19.06
7 31.64 31.85 22.65 23.48 41.85 42.15 23.05 23.77
9 37.78 37.11 26.84 27.03 51.48 51.84 28.94 29.28
Full 47.91 47.29 33.44 32.32 67.83 65.89 40.41 39.92
Table 2: METEOR scores of SNMT (S) and MSNMT (M) models for four language pairs on test2017. Results are the average of three runs. Bold indicates the best METEOR score for each wait-k for each translation direction. “” indicates statistical significance of the improvement over SNMT.
En De Fr Cs
12.36 18.65 17.71 8.76
Table 3: METEOR scores of Captioning models into four target languages on test2016. Results are the average of three runs.

4.2 Systems

We compare the following models:

1. Captioning:

We experimented on image captioning in order to examine the effect of using visual clues only to produce adequate translations. In this setting, instead of an input sentence, we used only one

<cpt> token for each image of Multi30k to produce its description using MSNMT architecture.

2. Snmt:

We use only text modality for training data as a baseline for each wait-k model.

3. Msnmt:

We use image modality along with text modality for a training data for each wait-k model.

To train the above models, we utilize attention NMT Bahdanau et al. (2015)

with a 2-layer unidirectional GRU encoder and a 2-layer conditional GRU decoder. We use the open-source implementation of the

nmtpytorch toolkit v3.0.0 Caglayan et al. (2017b).666

Due to space constraints, we list hyperparameters in Appendix

A.
The hyper-parameters not mentioned in this table were set to the default values in nmtpytorch. We incorporated early-stopping: when the METEOR score Denkowski and Lavie (2011)

did not increase on the development set for 10 epochs, the training was stopped.

5 Results

In this section, we report METEOR scores, which is a widely used evaluation metric in MNMT, on our test sets for each

wait-k model.777Due to space constraints, we show results only for test sets. Additionally, we report their BLEU scores in Appendix B. Statistical significance () on the difference of BLEU scores was tested by Moses’s bootstrap-hypothesis-difference-significance.pl. “Full” means that the whole input sentence is used as an input for the model. All reported results are the average of three runs using three different random seeds.

Tables 1-2 illustrate the METEOR scores of MSNMT and SNMT models on test2016 and test2017, respectively. For all language pairs, MSNMT systems show significant improvements over SNMT systems when input textual information is scarce (k 5). Note that the difference of METEOR scores between MSNMT and SNMT grows larger as the input sentence gets shorter. On the other hand, the availability of more tokens during the decoding process (k 5) leads to the text information becoming sufficient in most cases.

wait- EnDe DeEn EnFr FrEn EnCs CsEn
k C I C I C I C I C I C I
1 20.41 14.83 13.86 9.60 19.84 13.02 13.46 9.34 9.97 6.32 15.39 12.35
3 26.66 23.50 17.45 14.86 28.54 24.47 18.02 15.08 13.40 11.49 18.36 16.72
5 31.30 29.01 20.55 18.84 35.92 32.30 21.29 18.84 16.31 14.93 23.22 21.62
7 36.80 35.17 25.35 24.05 44.75 42.60 25.73 24.01 19.29 18.20 28.02 27.27
9 42.34 41.39 29.04 28.30 53.62 52.03 30.29 29.29 22.34 21.66 30.96 30.24
Full 53.92 53.43 35.59 35.37 72.00 71.74 42.27 42.29 28.79 28.80 35.39 34.91
Table 4: Image Awareness results on test2016. METEOR scores of MSNMT Congruent (C) and Incongruent (I) settings for six translation directions. Results are the average of three runs. Bold indicates the best METEOR score for each wait-k for each translation direction.
wait- EnDe DeEn EnFr FrEn EnCs CsEn
k S M S M S M S M S M S M
1 11.33 16.29 8.65 11.21 10.99 15.82 8.52 10.94 6.27 7.58 8.19 10.41
3 12.46 13.04 7.48 9.07 10.28 12.62 7.42 9.22 5.31 6.34 7.23 8.67
5 11.01 12.27 6.93 8.41 9.49 11.20 6.95 8.20 4.52 4.98 6.80 7.70
7 10.47 11.59 6.69 7.64 8.98 10.40 6.64 7.52 4.32 4.72 6.58 7.23
9 10.09 10.48 6.47 7.04 8.73 9.68 6.52 6.92 4.09 4.60 6.47 7.07
Full 9.86 9.89 6.30 6.49 8.69 8.96 6.20 6.25 4.04 4.12 6.41 6.58
Table 5: Text Awareness results on test2016. METEOR scores of SNMT (S) and MSNMT (M) models for six translation directions. Results are the average of three runs. Bold indicates the best METEOR score for each wait-k for each translation direction.

The results of Captioning in Table 3 compared to those in Table 1 show that using only visual information is not enough for translation. The cause is that captioning does not consider the actual text and only describes the image itself.

6 Analysis

In this section, we provide a thorough analysis to further investigate the effect of visual data to produce a simultaneous translation by: (a) providing adversarial evaluation; and (b) visualizing attention.

Source a black dog and a brown dog with a ball .
Target ein schwarzer und ein brauner hund mit einem ball .
Captioning zwei hunde spielen im gras . (Two dogs are playing in the grass .)
S ein schwarzer hund springt über einen zaun . (a black dog jumps over a fence .)
M ein schwarzer hund und ein brauner hund rennen auf einem Feld . (a black dog and a brown dog run on a field .)
S full ein schwarzer hund und ein brauner hund mit einem ball . (a black dog and a brown dog with a ball .)
M full ein schwarzer hund und ein brauner hund mit einem ball . (a black dog and a brown dog with a ball .)
Source a baseball player in a black shirt just tagged a player in a white shirt .
Target eine baseballspielerin in einem schwarzen shirt fängt eine spielerin in einem weißen shirt .
Captioning ein mann in einem weißen trikot macht einen trick auf dem boden und hält dabei einen anderen mann .
(a man in a white jersey is doing a trick on the floor while holding another man .)
S ein baseballspieler in einem roten trikot versucht den ball zu fangen , während der schiedsrichter zuschaut .
(a baseball player in a red jersey tries to catch the ball while the referee is watching.)
M ein baseballspieler versucht , einen ball zu fangen .
(a baseball player is trying to catch a ball.)
S full ein baseballspieler in einem schwarzen hemd hat einen spieler in einem weißen hemd unk .
(a baseball player in a black shirt has a player in a white shirt unk .)
M full ein baseballspieler in einem schwarzen hemd hat gerade ein spieler in einem weißen hemd unk .
(a baseball player in a black shirt has just one player in a white shirt unk .)
Table 6: Examples of EnDe translations from test2016 using SNMT (S) and MSNMT (M) models. In () are shown their English meanings. Italic shows the correct translation outputs.

6.1 Adversarial Evaluation

In order to determine whether MSNMT systems are aware of the visual context Elliott (2018), we perform two different versions of adversarial evaluation on test2016:

Image Awareness.

We present our system with correct visual data with its source sentence (Congruent) as opposed to random visual data as an input (Incongruent) Elliott (2018). For that purpose, we reversed the order of 1,000 images of test2016, so there will be no overlapping congruent visual data. Then we reconstruct image features for those images to use as an input to a model.

Text Awareness.

We present our system with incorrect source sentences but with the correct visual information in order to determine the impact of visual data to produce correct translations for noisy text input. Similarly, we used the same shuffling technique as above for the text data.

Results of image awareness experiments are shown in Table 4. We can see the large difference in METEOR scores between MSNMT congruent and incongruent settings when the input text information is incomplete which implies that our proposed model learns to extract information from images for translation. The interesting part is for a full translation, where scores for the incongruent setting outperform or are very close to those of the congruent setting. The reason is that when textual information is enough, visual information becomes not that relevant in some cases.

(a) Dogs
(b) Players
Figure 2: Images presented in translation examples (Table 6) and attention visualization (Figures 3-4).
(a)
(b) Full
Figure 3: Attention visualization for MSNMT outputs for Figure 1(a) at each decoding step of EnDe translation (see Table 6).
(a)
(b) Full
Figure 4: Attention visualization for MSNMT outputs for Figure 1(b) at each decoding step of EnDe translation (see Table 6).

From the results of the text awareness experiments (see Table 5) we can draw the following conclusions. The fact that MSNMT models handle noisy text input better than SNMT models implies that the proposed model can leverage visual information. For both SNMT and MSNMT, the METEOR score degrades as the number of available first k tokens increases. We assume that the more noise is given as input, the more a model gets confused. However, visual information makes a model more robust to the introduced noise. MSNMT models also consider textual information, as models have lower performance as the input tokens are more restricted (opposed to Table 1, columns M).

Figure 5: Hierarchical (second) attention scores for visual features on test2016 for six translation directions in different wait-k models. Scores are averaged for all sentences in test2016 set.

6.2 Visual Attention

As an example, we sampled sentences and their images from test2016 (Figure 2) to compare the outputs of our systems. Table 6 lists their translations generated by Captioning, SNMT (S) and MSNMT (M) models. In the first example, Captioning did not capture “a ball” and “a black dog and a brown dog” presented in the source sentence. An SNMT model with predicted an erroneous “zaun (fence)” which is not present neither in source text nor in a corresponding image. On the other hand, the MSNMT model was able to capture both input text and visual information and generates a richer output. When a full sentence is given as an input, both MSNMT and SNMT translated it correctly. In the second example, none of the models generated correct translations. For example, Captioning and SNMT models generated words that do not present in either of inputs, such as “schiedsrichter (referee)” or “trick (trick).” Also, our MSNMT models failed to capture the gender of the source gender-neutral word “player” and translated it into “spieler” instead of “spielerin,” although it was obvious from the visual information.

For a more detailed analysis, first, we visualized attention on the image of the above example at each decoding step for “k=3” and “Full” input scenarios (see Figures 3-4). Given a piece of incomplete text information, the proposed MSNMT model attends to the different parts of an image. For example, when decoding a token “brauner,” MSNMT attends more on a brown dog, and when decoding “rennen,” the model attends to the legs of the dogs (see Figure 2(a)). Also, in the other example, MSNMT focuses on a player while decoding “baseballspieler.” We hypothesize that the MSNMT model is trying to find a piece of useful information from the image. In contrast, when an input text is fully given, MSNMT attends only localized parts of the image. These results show us, once again, that the visual data can enrich an incomplete input sentence and be used to produce more accurate translation with low latency in most cases.

Furthermore, we investigate how much attention is given to the visual information in each wait-k model. For that purpose, we simply calculate the average score of the second attention (Equation 8) to the visual features for each decoding step for all sentences. Figure 5 reports averages of second attention scores for visual features on test2016 for six translation directions. We can see that for the lower k values the MSNMT model utilizes image information more.

7 Conclusion

In this paper, we proposed a multimodal simultaneous neural machine translation approach which takes advantage of visual information as an additional modality to compensate for the shortage of input text information in the simultaneous neural machine translation. We showed that in a wait-k setting our model significantly outperformed its text-only counterpart in situations where only a few input tokens are available to begin translation. Furthermore, we showed the importance of the visual information for simultaneous translation, especially in small k settings, by performing a thorough analysis on the Multi30k data. We hope that our proposed method can be explored even further for various tasks and datasets.

In this paper, we created a separate model for each value of wait-k. However, in future work, we plan to experiment on having a single model for all k values Zheng et al. (2019). Furthermore, we acknowledge the importance of investigating MSNMT effects on more realistic data (e.g. TED), where the utterance does not necessarily match a shown image while speaking and/or where its context can not be guessed from the shown image.

Acknowledgments

We immensely grateful to Raj Dabre and Rob van der Goot who provided expertise, support and insightful comments that greatly improved the manuscript. We would also like to show our gratitude to Desmond Elliot for valuable feedback and discussions of the paper.

References

  • Arivazhagan et al. (2019) Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. 2019. Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1313–1323.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.
  • Caglayan et al. (2017a) Ozan Caglayan, Walid Aransa, Adrien Bardet, Mercedes García-Martínez, Fethi Bougares, Loïc Barrault, Marc Masana, Luis Herranz, and Joost van de Weijer. 2017a. LIUM-CVC submissions for WMT17 multimodal translation task. In Proceedings of the Second Conference on Machine Translation, pages 432–439.
  • Caglayan et al. (2017b) Ozan Caglayan, Mercedes García-Martínez, Adrien Bardet, Walid Aransa, Fethi Bougares, and Loïc Barrault. 2017b. NMTPY: A flexible toolkit for advanced neural machine translation systems. The Prague Bulletin of Mathematical Linguistics, 109(1):15–28.
  • Caglayan et al. (2019) Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and Loïc Barrault. 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4159–4170.
  • Calixto et al. (2017) Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1913–1924.
  • Dalvi et al. (2018) Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 493–499.
  • Denkowski and Lavie (2011) Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 85–91.
  • Elliott (2018) Desmond Elliott. 2018. Adversarial evaluation of multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2974–2978.
  • Elliott et al. (2017) Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, pages 215–233.
  • Elliott et al. (2016) Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30k: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74.
  • Elliott and Kádár (2017) Desmond Elliott and Àkos Kádár. 2017. Imagination improves multimodal translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 130–141.
  • Gella et al. (2019) Spandana Gella, Desmond Elliott, and Frank Keller. 2019. Cross-lingual visual verb sense disambiguation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1998–2004.
  • Grissom II et al. (2014) Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé III. 2014.

    Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation.

    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1342–1352.
  • Gu et al. (2017) Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. 2017. Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1, Long Papers), pages 1053–1062.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 770–778.
  • Hitschler et al. (2016) Julian Hitschler, Shigehiko Schamoni, and Stefan Riezler. 2016. Multimodal pivots for image caption translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2399–2409.
  • Lala and Specia (2018) Chiraag Lala and Lucia Specia. 2018. Multimodal lexical translation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 3810–3817.
  • Libovický and Helcl (2017) Jindřich Libovický and Jindřich Helcl. 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 196–202.
  • Ma et al. (2019) Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2019. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036.
  • Matsubara et al. (2000) Shigeki Matsubara, Kiyoshi Iwashima, Nobuo Kawaguchi, Katsuhiko Toyama, and Yoichi Inagaki. 2000. Simultaneous Japanese-English interpretation based on early prediction of English verb. In Proceedings of The Fourth Symposium on Natural Language Processing, pages 268–273.
  • Oda et al. (2015) Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Syntax-based simultaneous translation through prediction of unseen syntactic constituents. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 198–207.
  • Seeber (2015) Kilian G Seeber. 2015. Simultaneous interpreting. In The Routledge Handbook of Interpreting, pages 91–107.
  • Specia et al. (2016) Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: (Volume 2: Shared Task Papers), pages 543–553.
  • Zheng et al. (2019) Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019.

    Simultaneous translation with flexible policy via restricted imitation learning.

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5816–5822.
  • Zhou et al. (2018) Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. 2018. A visual attention grounding neural model for multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3643–3653.

Appendix A Hyperparameters

Table 7 lists the hyperparameters of the SNMT and MSNMT models used in our experiments. We use the same hyperparameters, except for unique ones, for SNMT and MSNMT for a fair comparison.

Parameter SNMT & MSNMT
Enc., Dec. dim. 320
Emb. dim. 200
Dropout 0.5
Dropout for emb. 0.4
Tied embedding 2-way
Max length 100
Optimizer adam
Learning rate 0.0004
Batch size 64
wait-k 1, 3, 5, 7, 9, Full
Parameter MSNMT
Sampler type approximate
Dec. init zero
Fusion type hierarchical
Channels 1024
Table 7: Hyperparameter values of SNMT and MSNMT models.

Appendix B BLEU scores

Tables 8-10 show BLEU scores of models used in our experiments (corresponding METEOR scores are shown in Tables 1-3).

wait- EnDe DeEn EnFr FrEn EnCs CsEn
k S M S M S M S M S M S M
1 1.13 4.51 4.20 7.43 2.09 6.16 4.17 7.12 1.10 3.10 6.92 9.50
3 8.74 11.12 11.33 12.91 12.25 13.62 12.37 14.24 5.09 6.52 15.03 15.13
5 15.86 16.52 17.96 18.05 21.80 22.47 20.29 21.28 11.35 11.73 21.63 21.93
7 21.95 22.25 24.35 24.54 31.21 31.34 27.51 28.02 16.66 16.93 27.36 27.74
9 27.16 26.75 29.32 29.37 39.88 39.83 34.64 34.58 21.11 21.02 31.79 30.12
Full 35.55 34.14 38.56 37.47 58.27 56.33 51.85 50.07 29.69 28.80 36.29 36.29
Table 8: BLEU scores of SNMT (S) and MSNMT (M) models for six translation directions on test2016. Results are the average of three runs. Bold indicates the best BLEU score for each wait-k for each translation direction. “” indicates statistical significance of the improvement over SNMT.
wait- EnDe DeEn EnFr FrEn
k S M S M S M S M
1 0.09 2.30 2.39 4.77 1.62 4.52 2.33 5.43
3 5.61 6.83 7.40 9.09 8.72 10.37 9.02 10.49
5 9.50 9.33 13.08 14.22 16.93 17.40 15.16 16.59
7 14.94 14.57 20.10 20.99 27.23 27.20 23.55 23.73
9 20.80 20.15 25.50 25.08 35.99 36.08 32.01 31.59
Full 26.62 26.34 32.72 31.10 50.36 48.70 46.01 45.31
Table 9: BLEU scores of SNMT (S) and MSNMT (M) models for four language pairs on test2017. Results are the average of three runs. Bold indicates the best BLEU score for each wait-k for each translation direction. “” indicates statistical significance of the improvement over SNMT.
En De Fr Cs
5.68 3.80 4.97 2.77
Table 10: BLEU scores of Captioning models into four target languages on test2016. Results are the average of three runs.