established on the encoder-decoder framework, where the encoder takes a source sentence as input and encodes it into a fixed-length embedding vector and the decoder generates the translation sentence according to the encoder embedding, has achieved advanced translation performance in recent years. So far, most models take a standard assumption to translate every sentence independently, ignoring the document-level contextual clues during translation. However, document-level information can improve the translation performance from multiple aspects: consistency, disambiguation, and coherenceKuang et al. (2018). If translating every sentence is independent of document-level context, it will be difficult to keep every sentence translations across the entire text consistent with each other. Moreover, the document-level context can also assist the model to disambiguate words with multiple senses. At last, the global context helps translate in a coherent way.
There have been few recent attempts to introduce the document-level information into the existing standard NMT models. Jean et al. (2017) model the context from the surrounding text in addition to the source sentence, and Tiedemann and Scherrer (2017) extend the source sentence and translation units with the contextual segments to improve the translation. Wang et al. (2017
) use a hierarchical Recurrent Neural Network (RNN) to import the information of previous sentences.Miculicich et al. (2018) propose a multi-head hierarchical attention machine translation model to capture the word-level and sentence-level information. The cache-based model raised by Kuang et al. (2018) uses the dynamic cache and topic cache to capture the connection from neighboring sentences. In addition, Wang et al. (2017), Kuang and Xiong (2018) and Voita et al. (2018) all add the contextual information to the NMT model by applying the gating mechanism proposed by Tu et al. (2017) to dynamically control the auxiliary global context information at each decoding step. However, most of the existing document-level NMt methods have to inconveniently prepare the contextual input or model the global context in advance.
Inspired by the observation that human and document-level machine translation model always refer to the context of the source sentence during the translation, like query in their memory, we propose to utilize the document-level sentences associated with the source sentences to help predict the target sentence. To reach such a goal, we adopt a Memory Network componentWeston et al. (2015); Sukhbaatar et al. (2015); Guan et al. (2019) which provides a natural solution for the requirement of modeling document-level context in document-level NMT. In fact, Maruf and Haffari (2017) have already presented a document-level NMT model which projects the document contexts into the tiny dense hidden state space for RNN model using memory networks and updates word by word, and their model is effective in exploiting both source and target document context.
Differing from any previous work, this paper presents a Transformer NMT model with document-level Memory Network enhancement Weston et al. (2015); Sukhbaatar et al. (2015) which concludes contextual clues into the encoder of the source sentence by the Memory Network. Not like the work of Maruf and Haffari (2017)’s which memorizes the whole document information into a tiny dense hidden state, the memory in our work calculates the associated document-level contextualized information in the memory with the current source sentence using attention mechanism. In this way, our proposed model is able to focus on the most relevant part of the concerned translation from the memory which exactly encodes the concerned document-level context.
The empirical results indicate that our proposed method significantly improves the BLEU score compared with a strong Transformer baseline and performs better than other related models for document-level machine translation on multiple language pairs with multiple domains.
2 Related Work
The existing work about NMT on document-level can be divided into two parts: one is how to obtain the document-level information in NMT, and the other is how to integrate the document-level information.
2.1 Mining Document-level Information
Tiedemann and Scherrer (2017) propose to simply extend the context during the NMT model training in different ways: (1) extending the source sentence which includes the context from the previous sentences in the source language, and (2) extending translation units which increase the segments to be translated.
Wang et al. (2017) propose a cross-sentence context-aware RNN approach to produce a global context representation called Document RNN. Given a source sentence in the document to be translated and its previous sentences, they can obtain all sentence-level representations after processing each sentence. The last hidden state represents the summary of the whole sentence as it stores order-sensitive information. Then the summary of the global context is represented by the last hidden state over the sequence of the above sentence-level representations.
Specific Vocabulary Bias
2.2 Integrating Document-level Information
Adding Auxiliary Context
Gating Auxiliary Context
Tu et al. (2017) introduce a context gate to automatically control the ratios of source and context representations contributions to the generation of target words. Wang et al. (2017) also introduce this mechanism in their work to dynamically control the information flowing from the global text at each decoding step.
Inter-sentence Gate Model
Kuang and Xiong (2018) propose an inter-sentence gate model, which is based on the attention-based NMT and uses the same encoder to encode two adjacent sentences and controls the amount of information flowing from the preceding sentence to the translation of the current sentence with an inter-sentence gate. This gate framework assigns element-wise weights to the input signals which are calculated by the context vectors of two adjacent sentences, target word representation and the decoder hidden state.
Cache-based Neural Model
) propose to augment NMT models with an external cache to exploit translation history. At each decoding step, the probability distribution over generated words is updated online depending on the translation history retrieved from the cache with a query of the current attention vector, which assists NMT models to dynamically adapt over time. The cache-based neural model proposed byKuang et al. (2018) consists of two components: topic cache and dynamic cache. When the decoder shifts to a new test document, the topic cache is emptied and filled with target topical words for the new test document. The dynamic cache is continuously expanded with newly generated target words from the best translation hypothesis of previous sentences. The final word prediction probability for the target word is calculated by a gate mechanism which combines the prediction probability from the cache-based neural model and the original NMT decoder.
Hierarchical Attention Networks
Voita et al. (2018) introduce the context information into the Transformer Vaswani et al. (2017) and leave the Transformer’s decoder intact while processing the context information on the encoder side. The model calculates the gate from the source sentence attention and the context sentence attention, then exploits their gated sum as the encoder output. Zhang et al. (2018) also extend the Transformer with a new context encoder to represent document-level context while incorporating it into both the original encoder and decoder by multi-head attention.
3.1 Neural Machine Translation
Given a source sentence in the document to be translated and a target sentence , NMT model computes the probability of translation from the source sentence to the target sentence word by word:
where is a substring containing words . Generally, with an RNN, the probability of generating the -th word is modeled as:
where is a nonlinear function that outputs the probability of previously generated word , and is the -th source representation. Then -th decoding hidden state is computed as
For NMT models with an encoder-decoder framework, the encoder maps an input sequence of symbol representations to a sequence of continuous representations . Then, the decoder generates the corresponding target sequence of symbols one element at a time.
3.2 Transformer Architecture
Only based on the attention mechanism, Vaswani et al. (2017) propose a network architecture called Transformer for NMT, which uses stacked self-attention and point-wise, fully connected layers for both encoder and decoder.
As illustrated in Figure 1 (a), the The encoder is composed of a stack of (usually equals to 6 identical layers and each layer has two sub-layers: (1) multi-head self-attention mechanism, and (2) a simple, position-wise fully connected feed-forward network.
Multi-head attention demonstrated in the Figure 1 (b) in the Transformer allows the model to jointly process information from different representation spaces at different positions. It linearly projects the queries , keys , and values times with different, learned linear projections to , , and dimensions respectively, then the attention function is performed in parallel, generating -dimensional output values, and yielding the final results by concatenating and once again projecting them. The core of multi-head attention is Scaled Dot-Product Attention and calculated as:
Similar to the encoder, the decoder is also composed of a stack of
identical layers but it inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. The Transformer also employs residual connections around each of the sub-layers, followed by layer normalization. Thus, the Transformer is more parallelizable and faster for translating than earlier RNN methods.
3.3 Memory Network
Memory networks Weston et al. (2015)
utilize the external memories as inference components based on long-range dependencies, which can be categorized into a sort of lazy machine learningAha (2013). Using the similar memorizing mechanism, memory-based learning methods have been also applied in multiple traditional models Daelemans (1999); Fix and Hodges Jr (1951); Skousen (1989, 2013); Lebowitz (1983); Nivre et al. (2004). A memory network introduced by Weston et al. (2015) is a set of vectors and the memory cell is potentially relevant to a discrete object (for example, a word) . The memory is equipped with a read and optionally a write operation. Given a query vector , the output vector produced by reading from the memory is , where scores the match between the query vector and the -th memory cell .
Our NMT model consists of two components: Contextual Associated Memory Network (CAMN) and a Transformer model. For the CAMN component, the core part is a neural controller, which acts as a “processor” to read memory from the contextual storage “RAM” according to the input before sending this memory to other components. The controller calculates the correlation between the input and memory data i.e. “memory addressing”.
Our model requires two encoders, the context encoder for CAMN and the source encoder for translation from input sentence representation.
Inspired by Wang et al. (2017) which introduces the Document RNN to summarize the cross-sentence context information, we use an RNN on the context sentence to generate the context representation, and the hidden state at each time step can represent the relation from the first word to the current word. The source encoder is composed of a stack of layers, as the same as the source encoder in the original Transformer Vaswani et al. (2017).
4.3 Contextual Associated Memory Network
The proposed contextual associated memory network consists of three parts, context selection, inter-sentence attention, and context gating.
We aim to utilize the context sentences and their representations to assist our model to predict the target sentences. For the sake of fairness, we can treat all sentences in the document as our source. However, it is impossible to attend all the sentences in training dataset because of the extremely high computing and memorizing cost. According to Voita et al. (2018), whose model gets the best performance when using a context encoder for the previous sentence, we use the previous sentence of the source sentence as the context sentence . Then, at each training step, we compose all the context sentences of the source sentences in the batch with size as the context sentences .
This part aims to attain the inter-sentence attention matrix, which can be also regarded as the core memory part of the CAMN. The input sentence and the context sentences in the memory first go through a multi-head attention layer to encode the contextualized information to each word representation:
where is the RNN output of the context sentence and the hidden state at time is
is an activation function, andis the -th word in the context sentence .
The lists of new word representations are denoted as follows:
Each word representation is as a vector , where is the size of hidden state in MultiHead function.
Then, for each context sentence representation , we apply the multi-head attention by treating the input sentence representation as the query sequence, on them and get the attention matrix :
Every element can be regarded as an indicator of similarity between the -th word in input sentence representation and the -th word in memory sentence representation .
Finally, we perform a softmax operation on every column in to normalize the value so that it can be considered as the probability from input sentence representation to memory sentence representation :
We treat the probability vector as a set of weights to sum all the representations in and get the memory-sentence-specified argument embedding :
Because the context sentences are different, the overall contributions of these word representations should be different as well. We let the model itself learn how to make use of these contextual word representations.
Following the attention combination mechanism Libovický and Helcl (2017), we use a weighted average strategy to combine these attention representations from different memory sources. We calculate the mean value of every raw similarity matrix to indicate the similarity between input sentence and context sentence , and we use the softmax function to normalize them to get a probability vector indicating the similarity of input sentence towards all the associated sentences :
where represents the mean function. Then, we use the probability vector as weight to sum all the contextual attention embedding for the final contextual attention embedding of the -th word in input sentence :
We annotate the -th source attention embedding and -th contextual attention embedding after the feed-forward operation as and . Then we use a context gate Tu et al. (2017) to integrate the source and context attentions and control the flow from the source side and the context side. The gate is calculated by
Their gated sum is
is the logistic sigmoid function,is the point-wise multiplication and is trained by the model. As illustrated in Figure 2, the output of the gate is integrated into the encoder-decoder attention part at decoding step.
Multiple Context Attention
For multiple context sentences in the memory (m¿1) , we have two ways to integrate the memory information. One way is concatenate multiple context attentionwhich concatenate the multiple context sentences into one context sequence with the break symbol ‘####‘ (Wang et al. 2017) to identify the sentence boundary.
The other way is parallel multiple context attention which calculates the weighted sum among the attentions between the each context sentence and the current sentence by softmax function as shown in Eq. (13) 111In this paper, due to the lack of the computational resource, we experiment only with the concatenate multiple context attention until now, for the experiment on parallel multiple context attention, we leave for the next version..
5 Experimental Setup
|Context-aware Transformer Voita et al. (2018)||18.24||38.74||40.19||23.76|
|Transformer with HAN Miculicich et al. (2018)||17.79||37.24||36.23||22.76|
|Contextual Associated Memory Network||18.65||39.19||40.70||24.38|
|- w/o RNN Context||18.44||38.46||40.10||22.87|
|- w/o Inter-sentence Attention||18.36||38.74||40.38||23.96|
|- w/o RNN Context & Inter-sentence Attention||17.92||38.46||39.96||22.09|
The proposed document-level NMT model will be evaluated on two language pairs, i.e., Chinese-to-English (Zh-En) and Spanish-to-English (Es-En) on three domains: talks, subtitles, and news.
Zh-En TED talk documents are the parts of the IWSLT2015 Evaluation Campaign Machine Translation task222https://wit3.fbk.eu/mt.php?release=2015-01. We use dev2010 as the development set and combine the tst2010-2013 as the test set. The Es-En corpus is also a subset of the IWSLT2014. We use the dev2010 for development set and test2010-2012 as the test set.
The Es-En News-Commentaries11 corpus555http://opus.nlpl.eu/News-Commentary11.php has document-level delimitation. We evaluate on the WMT sets Bojar et al. (2013): newstest2008 for development, and newstest2009-2013 for testing.
Table 1 lists the statistics of all the concerned datasets.
|Random context sentence(s)||18.38||18.37||18.11||17.42||15.88|
5.2 Data preprocessing
The English and Spanish datasets are tokenized by tokenizer.perl and truecased by truecase.perl provided by MOSES666https://github.com/moses-smt/mosesdecoder, a statistical machine translation system proposed by Koehn et al. (2007). The Chinese corpus is tokenized by Jieba Chinese text segmentation777https://github.com/fxsjy/jieba. Words in sentences are segmented into subwords by Byte-Pair Encoding (BPE) Sennrich et al. (2016) with 32k BPE operations.
5.3 Model Configuration
Example of the translation result. The context sentences are three previous sentences before the source sentence and words in deeper blue from context indicate more heuristic clues for better translation. Salient contextual words have been provided with English translation.
6.1 Translation Performance
Table 2 demonstrates the BLEU scores for different models on multiple corpora. The baseline is a re-implemented attention-based NMT system RNNSearch* Hinton et al. (2012) and Transformer Vaswani et al. (2017) using THUMT kit. We also employ the Context-aware model proposed by Voita et al. (2018) on these datasets.
The results in Table 2 demonstrate that our proposed model significantly outperforms all the comparing models, especially, our model is significantly better than the baseline Transformer at significance level -value0.05. Our proposed model outperforms the RNNSearch* baseline by 2.56 BLEU point on the TED Talks (Zh-En) dataset, 2.64 BLEU point on the TED Talks (Es-En) dataset, 0.80 BLEU point on the OpenSubtitles (Es-En) dataset and 1.44 BLEU point on the WMT dataset (Es-En).
Furthermore, our proposed model achieves the gains of 0.89 BLEU point, 0.16 BLEU point, 0.74 BLEU point, and 0.44 BLEU point on these four datasets individually over the Transformer baseline. Compared with the Context-aware Transformer proposed by Voita et al. (2018), our proposed approach also raises the average 0.49 BLEU score on these different datasets. Moreover, the average increase of BLEU score over the Transformer with HAN proposed by Miculicich et al. (2018)999The results of HAN are reported by its authors. is 2.23 point.
6.2 Ablation Experiments
We investigate the impact of different components of our model by removing one or more of them.
If we do not employ the RNN operation on the context encoder, the multi-head attention works directly on the context embedding.
If the model is trained without the inter-sentence attention module of the CAMN, we select the context sentence randomly from the training set, and the context attention is generated by the hidden states of the context embedding after RNN.
If we remove the RNN operation and the inter-sentence attention, the context attention is produced by the word embedding of randomly selected context sentence and the context encoder with a stack of multi-head attention and feed-forward layers is as the same as the source encoder.
As shown in Table 3, all of the components greatly contribute to the performance of our proposed model. If we remove any step in Context Encoder, the performance drops dramatically. Such results indicate that all features introduced by our CAMN enhanced model play an important and complementary role in our model.
6.3 Effect of Contextual Information
Different definition of context sentence The context sentence in our work is the previous sentence of the current sentence. We investigate the effect of the different context sentence definition on the TED Talks (Zh-En) dataset. Like the work of Voita et al. (2018), we use the context encoder for the previous sentence, next sentence and the random selected context sentence form the document.
Different context size We also compare the effect with the different context size on the TED Talks (Zh-En) dataset.
As shown in the Table 4, the model use the previous sentence as the context encoder could get the best performance on the TED Talks (Zh-En) dataset. Moreover, more contextual information by concatenate multiple context attention does not appear beneficial and the BLEU score does not get better with longer context sentence.
6.4 Translation Quality
Table 5 shows an example from the TED Talks (Zh-En)101010This example is extracted from line 4,123 of TED Talks (Zh-En)., on which the translation our model is compared to those of other methods. The translation of HAN model is downloaded from Miculicich et al. (2018)’s GitHub111111https://github.com/idiap/HAN_NMT/tree/master/test_out. This example shows that our proposed model is capable of recognizing the tense and even discourse relation from document-level context.
7 Conclusion and Future Work
We propose a memory network enhancement over Transformer based NMT which provides a natural solution for the requirement of modeling document-level context. Experiments show that our model performs better on the datasets of multiple domains and language pairs and has the ability to capture salient document-level contextual clues and select the most relevant part related to the input sequence from the memory.
In our future work, we consider introducing the discourse information to enhance our model. But it will bring a lot of noise, and the internal structure may be particularly complex. Therefore, it is necessary to effectively abstract its key feature information. The discourse information will provide the heuristic features that will improve the performance during the training and decoding.
- Aha (2013) David W Aha. 2013. Lazy learning. Springer Science & Business Media.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, pages 1–15, San Diego, USA.
- Bojar et al. (2013) Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
- Collins et al. (2005) Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 531–540, Ann Arbor, Michigan. Association for Computational Linguistics.
Walter Daelemans. 1999.
Introduction to the special issue on memory-based language
Journal of Experimental & Theoretical Artificial Intelligence, 11(3):287–296.
- Fix and Hodges Jr (1951) Evelyn Fix and Joseph L Hodges Jr. 1951. Discriminatory analysis-nonparametric discrimination: consistency properties. Technical report, California Univ Berkeley.
- Guan et al. (2019) Chaoyu Guan, Yuhao Cheng, and Hai Zhao. 2019. Semantic role labeling with associated memory network. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3361–3371, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hinton et al. (2012) Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors.
- Jean et al. (2017) Sebastien Jean, Stanislas Lauly, Orhan Firat, and Kyunghyun Cho. 2017. Does Neural Machine Translation Benefit from Larger Context? arXiv e-prints, page arXiv:1704.05135.
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, Seattle, Washington, USA. Association for Computational Linguistics.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
- Kuang and Xiong (2018) Shaohui Kuang and Deyi Xiong. 2018. Fusing recency into neural machine translation with an inter-sentence gate model. In Proceedings of the 27th International Conference on Computational Linguistics, pages 607–617, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Kuang et al. (2018) Shaohui Kuang, Deyi Xiong, Weihua Luo, and Guodong Zhou. 2018. Modeling coherence for neural machine translation with dynamic and topic caches. In Proceedings of the 27th International Conference on Computational Linguistics, pages 596–606, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Lebowitz (1983) Michael Lebowitz. 1983. Memory-based parsing. Artificial Intelligence, 21(4):363–404.
- Libovický and Helcl (2017) Jindřich Libovický and Jindřich Helcl. 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 196–202, Vancouver, Canada. Association for Computational Linguistics.
- Lison and Tiedemann (2016) Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia. European Language Resources Association (ELRA).
- Maruf and Haffari (2018) Sameen Maruf and Gholamreza Haffari. 2018. Document context neural machine translation with memory networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1275–1284, Melbourne, Australia. Association for Computational Linguistics.
- Michel and Neubig (2018) Paul Michel and Graham Neubig. 2018. Extreme adaptation for personalized neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 312–318, Melbourne, Australia. Association for Computational Linguistics.
- Miculicich et al. (2018) Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.
- Nivre et al. (2004) Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Skousen (1989) Royal Skousen. 1989. Analogical modeling of language. Springer Science & Business Media.
- Skousen (2013) Royal Skousen. 2013. Analogy and structure. Springer Science & Business Media.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc.
- Tiedemann and Scherrer (2017) Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
- Tu et al. (2017) Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017. Context gates for neural machine translation. Transactions of the Association for Computational Linguistics, 5:87–99.
- Tu et al. (2018) Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. 2018. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6:407–420.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Voita et al. (2018) Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.
- Wang et al. (2017) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2826–2831, Copenhagen, Denmark. Association for Computational Linguistics.
- Weston et al. (2015) Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. arXiv preprint arXiv:1410.3916.
- Zhang et al. (2017) Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huan-Bo Luan, and Yang Liu. 2017. THUMT: an open source toolkit for neural machine translation. CoRR, abs/1706.06415.
- Zhang et al. (2018) Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018. Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 533–542, Brussels, Belgium. Association for Computational Linguistics.