Improving Context-aware Neural Machine Translation with Target-side Context

09/02/2019 ∙ by Hayahide Yamagishi, et al. ∙ 0

In recent years, several studies on neural machine translation (NMT) have attempted to use document-level context by using a multi-encoder and two attention mechanisms to read the current and previous sentences to incorporate the context of the previous sentences. These studies concluded that the target-side context is less useful than the source-side context. However, we considered that the reason why the target-side context is less useful lies in the architecture used to model these contexts. Therefore, in this study, we investigate how the target-side context can improve context-aware neural machine translation. We propose a weight sharing method wherein NMT saves decoder states and calculates an attention vector using the saved states when translating a current sentence. Our experiments show that the target-side context is also useful if we plug it into NMT as the decoder state when translating a previous sentence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Separated source model.
(b) Separated target model.
(c) Shared source model.
(d) Shared target model.
Figure 1: Proposed methods: dashed line represents the weight sharing with the encoders or decoders.

Neural machine translation (NMT; Sutskever et al. [1], Bahdanau et al. [2], Vaswani et al. [3]) has become popular in recent years because it can handle larger contexts compared to conventional machine translation systems. However, most of the NMTs do not employ document-level contexts due to lack of an efficient mechanism, similar to other machine translation systems.

Recently, a few studies have attempted to expand the notion of a sentence-level context in NMT to that of a document-level context111Hereinafter, “document-level context” is simply referred to as a “context”.. It is reported that the information of one or more previous sentences improves the scores of automatic and human evaluations.

Context-aware NMT systems typically have two encoders: one is for a current sentence and the other is for a previous sentence. For instance, Bawden et al. [4] showed that encoding a previous target sentence does not improve the performance in an English–French task even though encoding a previous source sentence works well. Other studies that utilized a multi-encoder (Jean et al. [5], Voita et al. [6], Zhang et al. [7]) did not use a previous target sentence. Thus, there are a few works on handling the target-side context. Moreover, these previous works mainly used language pairs that belonged to the same language family. In distant language pairs, the information of discourse structures in the target-side document might be useful because distant languages might have different discourse structures.

Therefore, this study investigates how the target-side context can be used in context-aware NMT. We hypothesize that the source-side contexts should be incorporated into an encoder and the target-side contexts should be incorporated into a decoder. To validate this hypothesis, we propose a weight sharing method, in which NMT saves the decoder states and calculates an attention vector using the saved states when translating a current sentence. We find that target-side contexts are also useful if they are inserted into the NMT as the decoder states. This method can obtain competitive or even better results compared to a baseline model using source-side features.

The main findings of this study are as follows:

  • The target-side context is as important as the source-side context.

  • The effectiveness of source-side context depends on language pairs.

  • Weight sharing between current and context states is effective for context-aware NMT.

2 Model Architecture

Figure 1 presents our methods. We build context-aware NMT based on the multi-encoder model proposed by Bawden et al. [4]. A parallel document consisting of sentence pairs, is denoted by , where and are source and target sentences, respectively. Each sentence, or , is denoted as or , where or are the tokens, and or

are the sentence lengths. The objective is to maximize the following probabilities:

(1)

where represents a previous sentence, or , depending on the experimental settings. Each is calculated as follows:

(2)
(3)
(4)
(5)

where , , and represents encoder states, decoder states, and attention, respectively. and represents weights. We calculate the encoder state and the decoder state as follows:

(6)
(7)

where and represents word embeddings of source- and target sides, respectively. We use the dot product of encoder states and hidden states as an attention score , proposed by Luong et al. [8].

The multi-encoder model has an additional attention, , which is for using the information of a previous sentence.

(8)
(9)

We experiment using two methods, separated model and shared model. The separated model represents the conventional multi-encoder model, and the shared model is our proposed method. The difference between the two methods is the calculation of .

Figure 2: Shared mix model.

2.1 Separated model

Context-aware NMT saves and encodes a source or target sentence in a context encoder when translating a current sentence. Previous works on multi-encoder models have an additional encoder, referred to as a context encoder. Each context encoder or reads a previous source-side or target-side sentence as context, respectively.

(10)
(11)

We refer to this architecture as a separated model in this paper. In the separated model, the weights of a context encoder are different from those of a current encoder which encodes a current source sentence. If is used as , we call this model separated source model; otherwise, we call this model separated target model.

2.2 Shared model

A shared model saves the hidden states of an encoder or decoder and then calculates using these states when translating a current sentence. The strength of this model is that the target-side context can be incorporated into a decoder instead of an encoder. Moreover, the shared model does not require much additional parameters and extra computational times because this model simply loads the saved hidden states. Thus, we can see these models as examples of weight sharing between a current encoder or decoder and a context encoder. The shared source model uses as , and the shared target model uses as .

2.3 Shared mix model

We propose a shared mix model, which incorporates the source- and target-side contexts. Figure 2 presents the shared mix model. The attention vector of the shared mix model is calculated as , where and are the context attentions calculated by the equation (8). The reason for calculating the sum of two attention is to arrange the same number of parameters as the other shared models. Other architectures are the same as the other shared models.

3 Experiments

3.1 Data

Corpus Train Dev Test
TED De–En 203,998 888 1,305
TED Zh–En 226,196 879 1,297
TED Ja–En 194,170 871 1,285
Recipe Ja–En 108,990 3,303 2,804
Table 1: Number of sentences in each dataset.
Experiment Baseline Separated Shared
Source Target Source Target Mix
TED De–En 26.55
TED En–De 21.26
TED Zh–En 12.54
TED En–Zh 8.97
TED Ja–En 5.84
TED En–Ja 8.40
Recipe Ja–En 25.34
Recipe En–Ja 20.81
Table 2: BLEU scores of our context-aware NMT in each language pair. Each score is the average of three runs. “” represents the statistically significant results against the baseline at in all the runs.

We mainly use the IWSLT2017 German–English, Chinese–English, and Japanese–English datasets from TED [9] for experiments. We consider each talk of TED as a document, which includes sentences that cannot be translated using only sentence-level information. Japanese and Chinese sentences are segmented by the MeCab222http://taku910.github.io/mecab/ (dictionary: IPADic 2.7.0) and jieba333https://github.com/fxsjy/jieba, respectively. English and German sentences are segmented by tokenizer.perl included in Moses444http://www.statmt.org/moses/. The documents that include sentences consisting of more than 100 words are eliminated from the training corpus. We evaluate our methods on the 2014 test set. The statistics of preprocessed corpora are shown in Table 1. Byte pair encoding [10] is used separately for source and target languages for subword segmentation. The number of merge operations is 32,000.

Moreover, we use the Recipe Corpus555http://lotus.kuee.kyoto-u.ac.jp/WAT/recipe-corpus/, which consists of Japanese–English user-posted recipes, to investigate the influences in the different domains. The procedures of data preprocessing are the same as those for the TED corpus, except for the number of merge operations (8,000).

3.2 Settings

The baseline system of this experiment is our implementation of RNN-based NMT. The encoder is two-layer bi-LSTM, and the decoder is two-layer uni-LSTM. The dimensions of hidden states and embeddings are set to be 512. We use dropout with . The optimizer is AdaGrad with initial learning rate . Each batch consists of up to 128 documents. These settings are the same in the baseline and all context-aware models. Dot global attention is used for calculating context attention . We set because the first sentences in documents do not have any previous contexts.

The context-aware models are pretrained with the baseline system. Each model is trained for 30 epochs; then, the best model is selected with a development set. The results are evaluated using BLEU

[11]. We calculate the statistical significance between the baseline and our methods by the bootstrap resampling toolkit in Travatar [12]. Experiments are performed three times with different random seeds.

3.3 Results

Table 2 shows the results. The shared target model improves the performances in all language pairs. In the experiments on several language pairs, the separated target model used in Bawden et al. [4] also improves performances compared to the baseline. However, improvement is less compared to the shared target model. Therefore, these results show that the target-side context should be introduced from a decoder.

4 Discussion

Experiment Sentences
Input わかめ は よく 洗っ て 塩 を 落とし 、 10 分 ほど 水 に 浸け て おい て から ざく 切り に する 。 長 ねぎ は 小口切り に する 。
熱し た 鍋 に ごま油 を ひき 、 わかめ と 長 ねぎ を 入れ て 30 秒 ほど 軽く 炒める 。

Reference
wash the wakame well to remove the salt , put into a bowl of water for 10 minutes and drain . cut into large pieces . slice the Japanese leek .
heat a pan and pour the sesame oil . stir fry the wakame and leek for 30 seconds .

Baseline
wash the wakame seaweed well and remove the salt . soak in water for 10 minutes , then roughly chop . cut the Japanese leek into small pieces .
heat sesame oil in a heated pot , add the wakame and leek , and lightly sauté for about 30 seconds .

Separated Source
wash the wakame well , remove the salt , soak in water for about 10 minutes , then roughly chop . cut the Japanese leek into small pieces .
heat sesame oil in a heated pot and add the wakame and leek . stir-fry for about 30 seconds .

Shared Source
wash the wakame well, remove the salt , soak in water for about 10 minutes , then roughly chop . cut the Japanese leek into small pieces .
heat sesame oil in a heated pot and add the wakame and leek . stir-fry for about 30 seconds .

Separated Target
wash the wakame well , soak in water for about 10 minutes . cut into small pieces. cut the Japanese leek into small pieces .
heat the sesame oil in a frying pan , add the wakame and leek , and stir-fry for about 30 seconds .

Shared Target
wash the wakame well , remove the salt , soak in water for about 10 minutes , then roughly chop . chop the Japanese leek into small pieces .
heat sesame oil in a heated pan , add the wakame and Japanese leek , and lightly stir-fry for about 30 seconds .

Shared Mix
wash the wakame well , remove the salt , soak in water for about 10 minutes , then roughly chop . chop the Japanese leek into small pieces .
heat sesame oil in a heated pan , add the wakame and Japanese leek , and stir-fry for about 30 seconds .

Table 3: The output examples in Recipe Japanese–English experiments. Each upper sentence represents a previous sentence, and each lower sentence represents a current sentence. Each sequence may comprise several sentences because each sentence in Recipe corpus corresponds to “one step” of cooking.
(a) TED German–English
(b) TED English–German
(c) TED Chinese–English
(d) TED English–Chinese
(e) TED Japanese–English
(f) TED English–Japanese
(g) Recipe Japanese–English
(h) Recipe English–Japanese
Figure 3: The graph of BLEU scores using each development set. BLEU score is calculated at the end of each epoch.

4.1 Weight sharing

We expected that there would be no differences between the results of the shared source and separated source models because both models can introduce source-side context into the encoder. However, the results obtained for the language pairs used in this study show that the shared source model also improves the BLEU scores with fewer parameters. Dabre et al. [13] found that translation performances could be boosted even if the weights of stacked layers were shared. Our shared models can be seen as an instance of weight sharing for stacking sentence-level RNNs in chronological order. Shared models can also be seen as an instance of multitask learning that shares the same weights for encoder–decoders of neighboring sentences such as skip-thought [14]. Thus, it is possible that weight sharing leads to a more efficient model space by regularization, rather than by learning discourse structures.

4.2 Language dependency

The tendencies of the scores vary depending on language pairs. The result of the TED English–German task shows that the source-side context decreases the performance. Müller et al. [15] obtained similar results in other datasets using the concatenation method proposed in Tiedemann et al. [16]. However, in the Japanese–English and English–Japanese tasks, the importance of the source-side context is equivalent to that of the target-side one. The reason is that Japanese requires contexts more than English because Japanese is a pro-drop language, which allows for the omission of agents and object arguments when they are pragmatically or syntactically inferable. Comparing the result of the TED and Recipe corpora, the difference of corpus domains does not affect such tendencies. In the Chinese–English task, where they have more similar word order, the importance of target-side context is equivalent or even better compared to that of the source-side one. Therefore, these results imply that the necessity of the source-side context depends on language pairs, while the target-side context is generally important.

The shared mix model obtains competitive results compared to the shared source model, in most of the language pairs. Therefore, either of contexts helps the improvement without both side information if we choose the source- or target-side context depending on the language pairs.

4.3 Output examples

We analyze the output examples in terms of the phrase coherence. We select the Recipe Japanese–English task because Japanese is a pro-drop language that needs context due to many omissions but it is difficult to draw any definitive conclusions on the TED Japanese–English task as the BLEU score is too low to analyze. Table 3 shows the examples. When the model translates the previous sentence, this model does not use the context information because this is the first sentence of a document. The examples written in each lower row are the result using the information of the upper sentence as a context.

Looking at the result of the baseline, “長 ねぎ” (naga negi, Japanese leek) is translated into “Japanese leek” in the previous sentence, even though this is translated into “leek” in the current sentence. This phenomenon can be commonly seen in the results of separated models. If we independently evaluate these sentences, these sentences will be rated with high fluency and adequacy. BLEU scores are also high because reference sentences also follow this translation. However, these sentences have low coherence because the same noun phrase in the Japanese sentence is translated into different phrases.

On the contrary, “長 ねぎ” is translated into “Japanese leek” in both sentences in the experiments of shared target model and shared mix model, which use target-side context. Our models improve phrase coherence using weight sharing.

4.4 Convergence of training

Figure 3 plots the BLEU scores on development sets. Shared models and separated source model seem to be stable. However, the separated target model is unstable and does not lead to an improvement. This is due to the exposure bias problem [17] in the context encoder as well as the decoder. At the test phase, the separated target model has to read the low-quality sentence with well-trained encoder if the learning speed of encoding is faster than that of decoding. Thus, the separated target model should fill the gap between the learning speed of the context encoder and decoder.

5 Related Works

Wang et al. [18], Maruf et al. [19], and Tu et al. [20] incorporated the information of previous sentences by using a hierarchical encoder, a memory network and cache mechanism respectively. Although they used several sentences as contexts, the former two works found that the information of distant sentences in a document does not improve translation quality. Our investigation is focused on a previous sentence.

Tiedemann et al. [16] used the concatenation of a previous sentence and a current sentence as an input or output sentence to incorporate source-side and target-side contexts in conventional NMT. Müller et al. [15] evaluated the performance of existing context-aware NMT in the English–German task in terms of pronoun translation. They concluded that generating concatenated sentence is more effective than inputting concatenated sentence. Our results of the shared target model support their results.

Voita et al. [6] and Zhang et al. [7] proposed Transformer-based context-aware NMT. The former suggested that self-attention solves anaphora resolution. The latter performed fine-tuning with small document-level data to adapt a single-sentence NMT trained with large data to context-aware NMT. However, they did not investigate the influence of the target-side contexts.

6 Conclusion

We reported how context-aware neural machine translation effectively employs target-side contexts. We proposed a weight sharing to model the target-side context in a decoder. This method achieves high performances in several language pairs, even though it does not require much additional parameters. In the future, we will analyze whether this model can handle longer contexts.

References

  • [1]

    I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in

    Advances in Neural Information Processing Systems 27, 2014, pp. 3104–3112.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the International Conference on Learning Representations, 2015.
  • [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30, 2017, pp. 5998–6008.
  • [4] R. Bawden, R. Sennrich, A. Birch, and B. Haddow, “Evaluating discourse phenomena in neural machine translation,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1304–1313.
  • [5] S. Jean, S. Lauly, O. Firat, and K. Cho, “Does neural machine translation benefit from larger context?” CoRR, vol. abs/1704.05135, 2017.
  • [6] E. Voita, P. Serdyukov, R. Sennrich, and I. Titov, “Context-aware neural machine translation learns anaphora resolution,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1264–1274.
  • [7] J. Zhang, H. Luan, M. Sun, F. Zhai, J. Xu, M. Zhang, and Y. Liu, “Improving the Transformer translation model with document-level context,” in

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , 2018, pp. 533–542.
  • [8] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
  • [9] M. Cettolo, C. Girardi, and M. Federico, “WIT: Web inventory of transcribed and translated talks,” in Proceedings of the 16 Conference of the European Association for Machine Translation, May 2012, pp. 261–268.
  • [10] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725.
  • [11] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, pp. 311–318.
  • [12] G. Neubig, “Travatar: A forest-to-string machine translation engine based on tree transducers,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2013, pp. 91–96.
  • [13] R. Dabre and A. Fujita, “Recurrent stacking of layers for compact neural machine translation models,”

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 33, pp. 6292–6299, 2019.
  • [14] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in Neural Information Processing Systems 28, 2015, pp. 3294–3302.
  • [15] M. Müller, A. Rios, E. Voita, and R. Sennrich, “A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation,” in Proceedings of the Third Conference on Machine Translation: Research Papers, 2018, pp. 61–72.
  • [16] J. Tiedemann and Y. Scherrer, “Neural machine translation with extended context,” in Proceedings of the Third Workshop on Discourse in Machine Translation, 2017, pp. 82–92.
  • [17]

    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in

    Proceedings of the International Conference on Learning Representations, 2016.
  • [18] L. Wang, Z. Tu, A. Way, and Q. Liu, “Exploiting cross-sentence context for neural machine translation,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2826–2831.
  • [19] S. Maruf and G. Haffari, “Document context neural machine translation with memory networks,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1275–1284.
  • [20] Z. Tu, Y. Liu, S. Shi, and T. Zhang, “Learning to remember translation history with a continuous cache,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 407–420, 2018.