Log In Sign Up

Document-level Neural Machine Translation with Inter-Sentence Attention

by   Shu Jiang, et al.

Standard neural machine translation (NMT) is on the assumption of document-level context independent. Most existing document-level NMT methods only focus on briefly introducing document-level information but fail to concern about selecting the most related part inside document context. The capacity of memory network for detecting the most relevant part of the current sentence from the memory provides a natural solution for the requirement of modeling document-level context by document-level NMT. In this work, we propose a Transformer NMT system with associated memory network (AMN) to both capture the document-level context and select the most salient part related to the concerned translation from the memory. Experiments on several tasks show that the proposed method significantly improves the NMT performance over strong Transformer baselines and other related studies.


page 3

page 4

page 5

page 6

page 7

page 8

page 9

page 10


Document-level Neural Machine Translation with Document Embeddings

Standard neural machine translation (NMT) is on the assumption of docume...

When and Why is Document-level Context Useful in Neural Machine Translation?

Document-level context has received lots of attention for compensating n...

Learning to Remember Translation History with a Continuous Cache

Existing neural machine translation (NMT) models generally translate sen...

Enhancing Context Modeling with a Query-Guided Capsule Network for Document-level Translation

Context modeling is essential to generate coherent and consistent transl...

SMDT: Selective Memory-Augmented Neural Document Translation

Existing document-level neural machine translation (NMT) models have suf...

Using Context in Neural Machine Translation Training Objectives

We present Neural Machine Translation (NMT) training using document-leve...

Addressing Zero-Resource Domains Using Document-Level Context in Neural Machine Translation

Achieving satisfying performance in machine translation on domains for w...

1 Introduction

Neural Machine Translation (NMT) Kalchbrenner and Blunsom (2013); Sutskever et al. (2014); Cho et al. (2014); Bahdanau et al. (2015); Vaswani et al. (2017)

established on the encoder-decoder framework, where the encoder takes a source sentence as input and encodes it into a fixed-length embedding vector and the decoder generates the translation sentence according to the encoder embedding, has achieved advanced translation performance in recent years. So far, most models take a standard assumption to translate every sentence independently, ignoring the document-level contextual clues during translation. However, document-level information can improve the translation performance from multiple aspects: consistency, disambiguation, and coherence

Kuang et al. (2018). If translating every sentence is independent of document-level context, it will be difficult to keep every sentence translations across the entire text consistent with each other. Moreover, the document-level context can also assist the model to disambiguate words with multiple senses. At last, the global context helps translate in a coherent way.

There have been few recent attempts to introduce the document-level information into the existing standard NMT models. Jean et al. (2017) model the context from the surrounding text in addition to the source sentence, and Tiedemann and Scherrer (2017) extend the source sentence and translation units with the contextual segments to improve the translation. Wang et al. (2017

) use a hierarchical Recurrent Neural Network (RNN) to import the information of previous sentences.

Miculicich et al. (2018) propose a multi-head hierarchical attention machine translation model to capture the word-level and sentence-level information. The cache-based model raised by Kuang et al. (2018) uses the dynamic cache and topic cache to capture the connection from neighboring sentences. In addition, Wang et al. (2017), Kuang and Xiong (2018) and Voita et al. (2018) all add the contextual information to the NMT model by applying the gating mechanism proposed by Tu et al. (2017) to dynamically control the auxiliary global context information at each decoding step. However, most of the existing document-level NMt methods have to inconveniently prepare the contextual input or model the global context in advance.

Inspired by the observation that human and document-level machine translation model always refer to the context of the source sentence during the translation, like query in their memory, we propose to utilize the document-level sentences associated with the source sentences to help predict the target sentence. To reach such a goal, we adopt a Memory Network componentWeston et al. (2015); Sukhbaatar et al. (2015); Guan et al. (2019) which provides a natural solution for the requirement of modeling document-level context in document-level NMT. In fact, Maruf and Haffari (2017) have already presented a document-level NMT model which projects the document contexts into the tiny dense hidden state space for RNN model using memory networks and updates word by word, and their model is effective in exploiting both source and target document context.

Differing from any previous work, this paper presents a Transformer NMT model with document-level Memory Network enhancement Weston et al. (2015); Sukhbaatar et al. (2015) which concludes contextual clues into the encoder of the source sentence by the Memory Network. Not like the work of Maruf and Haffari (2017)’s which memorizes the whole document information into a tiny dense hidden state, the memory in our work calculates the associated document-level contextualized information in the memory with the current source sentence using attention mechanism. In this way, our proposed model is able to focus on the most relevant part of the concerned translation from the memory which exactly encodes the concerned document-level context.

The empirical results indicate that our proposed method significantly improves the BLEU score compared with a strong Transformer baseline and performs better than other related models for document-level machine translation on multiple language pairs with multiple domains.

Figure 1: (a) Transformer architecture. (b) Multi-Head attention.

2 Related Work

The existing work about NMT on document-level can be divided into two parts: one is how to obtain the document-level information in NMT, and the other is how to integrate the document-level information.

2.1 Mining Document-level Information


Tiedemann and Scherrer (2017) propose to simply extend the context during the NMT model training in different ways: (1) extending the source sentence which includes the context from the previous sentences in the source language, and (2) extending translation units which increase the segments to be translated.

Document RNN

Wang et al. (2017) propose a cross-sentence context-aware RNN approach to produce a global context representation called Document RNN. Given a source sentence in the document to be translated and its previous sentences, they can obtain all sentence-level representations after processing each sentence. The last hidden state represents the summary of the whole sentence as it stores order-sensitive information. Then the summary of the global context is represented by the last hidden state over the sequence of the above sentence-level representations.

Specific Vocabulary Bias

Michel and Neubig (2018) propose a simple yet parameter-efficient adaption method that only requires adapting the bias of output softmax to each particular use of the NMT system and allows the model to better reflect personal linguistic variations through translation.

2.2 Integrating Document-level Information

Adding Auxiliary Context

Wang et al. (2017

) add the representation of cross-sentence context into the equation of the probability of the next word directly and jointly update the decoding state by the previous predicted word and the source-side context vector.

Gating Auxiliary Context

Tu et al. (2017) introduce a context gate to automatically control the ratios of source and context representations contributions to the generation of target words. Wang et al. (2017) also introduce this mechanism in their work to dynamically control the information flowing from the global text at each decoding step.

Inter-sentence Gate Model

Kuang and Xiong (2018) propose an inter-sentence gate model, which is based on the attention-based NMT and uses the same encoder to encode two adjacent sentences and controls the amount of information flowing from the preceding sentence to the translation of the current sentence with an inter-sentence gate. This gate framework assigns element-wise weights to the input signals which are calculated by the context vectors of two adjacent sentences, target word representation and the decoder hidden state.

Cache-based Neural Model

Tu et al. (2018

) propose to augment NMT models with an external cache to exploit translation history. At each decoding step, the probability distribution over generated words is updated online depending on the translation history retrieved from the cache with a query of the current attention vector, which assists NMT models to dynamically adapt over time. The cache-based neural model proposed by

Kuang et al. (2018) consists of two components: topic cache and dynamic cache. When the decoder shifts to a new test document, the topic cache is emptied and filled with target topical words for the new test document. The dynamic cache is continuously expanded with newly generated target words from the best translation hypothesis of previous sentences. The final word prediction probability for the target word is calculated by a gate mechanism which combines the prediction probability from the cache-based neural model and the original NMT decoder.

Hierarchical Attention Networks

Miculicich et al. (2018) propose a Hierarchical Attention Networks (HAN) NMT model to capture the context in a structured and dynamic pattern. For each predicted word, it uses word-level and sentence-level abstractions and selectively focuses on different words and sentences.

Context-Aware Transformer

Voita et al. (2018) introduce the context information into the Transformer Vaswani et al. (2017) and leave the Transformer’s decoder intact while processing the context information on the encoder side. The model calculates the gate from the source sentence attention and the context sentence attention, then exploits their gated sum as the encoder output. Zhang et al. (2018) also extend the Transformer with a new context encoder to represent document-level context while incorporating it into both the original encoder and decoder by multi-head attention.

Figure 2: The framework of our model.

3 Background

3.1 Neural Machine Translation

Given a source sentence in the document to be translated and a target sentence , NMT model computes the probability of translation from the source sentence to the target sentence word by word:


where is a substring containing words . Generally, with an RNN, the probability of generating the -th word is modeled as:


where is a nonlinear function that outputs the probability of previously generated word , and is the -th source representation. Then -th decoding hidden state is computed as


For NMT models with an encoder-decoder framework, the encoder maps an input sequence of symbol representations to a sequence of continuous representations . Then, the decoder generates the corresponding target sequence of symbols one element at a time.

3.2 Transformer Architecture

Only based on the attention mechanism, Vaswani et al. (2017) propose a network architecture called Transformer for NMT, which uses stacked self-attention and point-wise, fully connected layers for both encoder and decoder.

As illustrated in Figure 1 (a), the The encoder is composed of a stack of (usually equals to 6 identical layers and each layer has two sub-layers: (1) multi-head self-attention mechanism, and (2) a simple, position-wise fully connected feed-forward network.

Multi-head attention demonstrated in the Figure 1 (b) in the Transformer allows the model to jointly process information from different representation spaces at different positions. It linearly projects the queries , keys , and values times with different, learned linear projections to , , and dimensions respectively, then the attention function is performed in parallel, generating -dimensional output values, and yielding the final results by concatenating and once again projecting them. The core of multi-head attention is Scaled Dot-Product Attention and calculated as:


The second sub-layer is a feed-forward network, which contains two linear transformations with a ReLU activation in between.

Similar to the encoder, the decoder is also composed of a stack of

identical layers but it inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. The Transformer also employs residual connections around each of the sub-layers, followed by layer normalization. Thus, the Transformer is more parallelizable and faster for translating than earlier RNN methods.

3.3 Memory Network

Memory networks Weston et al. (2015)

utilize the external memories as inference components based on long-range dependencies, which can be categorized into a sort of lazy machine learning

Aha (2013). Using the similar memorizing mechanism, memory-based learning methods have been also applied in multiple traditional models Daelemans (1999); Fix and Hodges Jr (1951); Skousen (1989, 2013); Lebowitz (1983); Nivre et al. (2004). A memory network introduced by Weston et al. (2015) is a set of vectors and the memory cell is potentially relevant to a discrete object (for example, a word) . The memory is equipped with a read and optionally a write operation. Given a query vector , the output vector produced by reading from the memory is , where scores the match between the query vector and the -th memory cell .

4 Model

4.1 Framework

Our NMT model consists of two components: Contextual Associated Memory Network (CAMN) and a Transformer model. For the CAMN component, the core part is a neural controller, which acts as a “processor” to read memory from the contextual storage “RAM” according to the input before sending this memory to other components. The controller calculates the correlation between the input and memory data i.e. “memory addressing”.

4.2 Encoders

Our model requires two encoders, the context encoder for CAMN and the source encoder for translation from input sentence representation.

Inspired by Wang et al. (2017) which introduces the Document RNN to summarize the cross-sentence context information, we use an RNN on the context sentence to generate the context representation, and the hidden state at each time step can represent the relation from the first word to the current word. The source encoder is composed of a stack of layers, as the same as the source encoder in the original Transformer Vaswani et al. (2017).

4.3 Contextual Associated Memory Network

The proposed contextual associated memory network consists of three parts, context selection, inter-sentence attention, and context gating.

Context Selection

We aim to utilize the context sentences and their representations to assist our model to predict the target sentences. For the sake of fairness, we can treat all sentences in the document as our source. However, it is impossible to attend all the sentences in training dataset because of the extremely high computing and memorizing cost. According to Voita et al. (2018), whose model gets the best performance when using a context encoder for the previous sentence, we use the previous sentence of the source sentence as the context sentence . Then, at each training step, we compose all the context sentences of the source sentences in the batch with size as the context sentences .

Inter-Sentence Attention

This part aims to attain the inter-sentence attention matrix, which can be also regarded as the core memory part of the CAMN. The input sentence and the context sentences in the memory first go through a multi-head attention layer to encode the contextualized information to each word representation:


where is the RNN output of the context sentence and the hidden state at time is



is an activation function, and

is the -th word in the context sentence .

The lists of new word representations are denoted as follows:




Each word representation is as a vector , where is the size of hidden state in MultiHead function.

Then, for each context sentence representation , we apply the multi-head attention by treating the input sentence representation as the query sequence, on them and get the attention matrix :


Every element can be regarded as an indicator of similarity between the -th word in input sentence representation and the -th word in memory sentence representation .

Finally, we perform a softmax operation on every column in to normalize the value so that it can be considered as the probability from input sentence representation to memory sentence representation :


We treat the probability vector as a set of weights to sum all the representations in and get the memory-sentence-specified argument embedding :


Because the context sentences are different, the overall contributions of these word representations should be different as well. We let the model itself learn how to make use of these contextual word representations.

Following the attention combination mechanism Libovický and Helcl (2017), we use a weighted average strategy to combine these attention representations from different memory sources. We calculate the mean value of every raw similarity matrix to indicate the similarity between input sentence and context sentence , and we use the softmax function to normalize them to get a probability vector indicating the similarity of input sentence towards all the associated sentences :


where represents the mean function. Then, we use the probability vector as weight to sum all the contextual attention embedding for the final contextual attention embedding of the -th word in input sentence :


Context Gating

We annotate the -th source attention embedding and -th contextual attention embedding after the feed-forward operation as and . Then we use a context gate Tu et al. (2017) to integrate the source and context attentions and control the flow from the source side and the context side. The gate is calculated by


Their gated sum is



is the logistic sigmoid function,

is the point-wise multiplication and is trained by the model. As illustrated in Figure 2, the output of the gate is integrated into the encoder-decoder attention part at decoding step.

Multiple Context Attention

For multiple context sentences in the memory (m¿1) , we have two ways to integrate the memory information. One way is concatenate multiple context attentionwhich concatenate the multiple context sentences into one context sequence with the break symbol ‘####‘ (Wang et al. 2017) to identify the sentence boundary.


The other way is parallel multiple context attention which calculates the weighted sum among the attentions between the each context sentence and the current sentence by softmax function as shown in Eq. (13) 111In this paper, due to the lack of the computational resource, we experiment only with the concatenate multiple context attention until now, for the experiment on parallel multiple context attention, we leave for the next version..

5 Experimental Setup

TED Talks Subtitles News
Zh-En Es-En Es-En Es-En
Training 209,941 180,853 48,301,352 238,872
Tuning 887 887 1,000 2,000
Test 5,473 4,706 1,000 14,522
Table 1: Data statistics of sentences.
TED Talks Subtitles News
Models Zh-En Es-En Es-En Es-En
RNNSearch* 16.09 36.55 39.90 22.94
Transformer 17.76 39.03 39.96 23.71
Context-aware Transformer Voita et al. (2018) 18.24 38.74 40.19 23.76
Transformer with HAN Miculicich et al. (2018) 17.79 37.24 36.23 22.76
Our model 18.65 39.19 40.70 24.38
Table 2: BLEU scores on the different datasets. The marks “†” after scores indicate that the proposed methods were significantly better than the baseline Transformer at significance level -value0.05 Collins et al. (2005). The scores in bold indicate the best ones on the same dataset.
TED Talks Subtitles News
Models Zh-En Es-En Es-En Es-En
Contextual Associated Memory Network 18.65 39.19 40.70 24.38
- w/o RNN Context 18.44 38.46 40.10 22.87
- w/o Inter-sentence Attention 18.36 38.74 40.38 23.96
- w/o RNN Context & Inter-sentence Attention 17.92 38.46 39.96 22.09
Table 3: Ablation study on these datasets.

5.1 Data

The proposed document-level NMT model will be evaluated on two language pairs, i.e., Chinese-to-English (Zh-En) and Spanish-to-English (Es-En) on three domains: talks, subtitles, and news.

TED Talks

Zh-En TED talk documents are the parts of the IWSLT2015 Evaluation Campaign Machine Translation task222 We use dev2010 as the development set and combine the tst2010-2013 as the test set. The Es-En corpus is also a subset of the IWSLT2014. We use the dev2010 for development set and test2010-2012 as the test set.


The Es-En corpus is a subset of OpenSubtitles2018333 Lison and Tiedemann (2016)444 We randomly select 1,000 continuous sentences for each development set and test set.


The Es-En News-Commentaries11 corpus555 has document-level delimitation. We evaluate on the WMT sets Bojar et al. (2013): newstest2008 for development, and newstest2009-2013 for testing.

Table 1 lists the statistics of all the concerned datasets.

Context selection =1 =2 =3 =4 =5
Previous sentence(s) 18.65 18.46 18.14 18.03 15.53
Next sentence(s) 18.57 18.45 17.69 17.43 15.14
Random context sentence(s) 18.38 18.37 18.11 17.42 15.88
Table 4: Results on TED Talks (Zh-En) dataset with different context sentence size and context selection

5.2 Data preprocessing

The English and Spanish datasets are tokenized by tokenizer.perl and truecased by truecase.perl provided by MOSES666, a statistical machine translation system proposed by Koehn et al. (2007). The Chinese corpus is tokenized by Jieba Chinese text segmentation777 Words in sentences are segmented into subwords by Byte-Pair Encoding (BPE) Sennrich et al. (2016) with 32k BPE operations.

5.3 Model Configuration

We use the Transformer proposed by Vaswani et al. (2017

) as our baseline and implement our work using the THUMT, an open-source toolkit for NMT developed by the Natural Language Processing Group at Tsinghua University

Zhang et al. (2017)888 We follow the configuration of the Transformer “base model” described in the original paper Vaswani et al. (2017). Both encoder and decoder consist of 6 hidden layers each, and we choose the previous sentence as the context sentence in the memory network. All hidden states have 512 dimensions, 8 heads for multi-head attention and the training batch contains about 6,520 source tokens. Finally, we evaluate the performance of the model by BLEU score Papineni et al. (2002) using multi-bleu.perl on the tokenized text.

Table 5:

Example of the translation result. The context sentences are three previous sentences before the source sentence and words in deeper blue from context indicate more heuristic clues for better translation. Salient contextual words have been provided with English translation.

6 Results

6.1 Translation Performance

Table 2 demonstrates the BLEU scores for different models on multiple corpora. The baseline is a re-implemented attention-based NMT system RNNSearch* Hinton et al. (2012) and Transformer Vaswani et al. (2017) using THUMT kit. We also employ the Context-aware model proposed by Voita et al. (2018) on these datasets.

The results in Table 2 demonstrate that our proposed model significantly outperforms all the comparing models, especially, our model is significantly better than the baseline Transformer at significance level -value0.05. Our proposed model outperforms the RNNSearch* baseline by 2.56 BLEU point on the TED Talks (Zh-En) dataset, 2.64 BLEU point on the TED Talks (Es-En) dataset, 0.80 BLEU point on the OpenSubtitles (Es-En) dataset and 1.44 BLEU point on the WMT dataset (Es-En).

Furthermore, our proposed model achieves the gains of 0.89 BLEU point, 0.16 BLEU point, 0.74 BLEU point, and 0.44 BLEU point on these four datasets individually over the Transformer baseline. Compared with the Context-aware Transformer proposed by Voita et al. (2018), our proposed approach also raises the average 0.49 BLEU score on these different datasets. Moreover, the average increase of BLEU score over the Transformer with HAN proposed by Miculicich et al. (2018)999The results of HAN are reported by its authors. is 2.23 point.

6.2 Ablation Experiments

We investigate the impact of different components of our model by removing one or more of them.

If we do not employ the RNN operation on the context encoder, the multi-head attention works directly on the context embedding.

If the model is trained without the inter-sentence attention module of the CAMN, we select the context sentence randomly from the training set, and the context attention is generated by the hidden states of the context embedding after RNN.

If we remove the RNN operation and the inter-sentence attention, the context attention is produced by the word embedding of randomly selected context sentence and the context encoder with a stack of multi-head attention and feed-forward layers is as the same as the source encoder.

As shown in Table 3, all of the components greatly contribute to the performance of our proposed model. If we remove any step in Context Encoder, the performance drops dramatically. Such results indicate that all features introduced by our CAMN enhanced model play an important and complementary role in our model.

6.3 Effect of Contextual Information

Different definition of context sentence The context sentence in our work is the previous sentence of the current sentence. We investigate the effect of the different context sentence definition on the TED Talks (Zh-En) dataset. Like the work of Voita et al. (2018), we use the context encoder for the previous sentence, next sentence and the random selected context sentence form the document.

Different context size We also compare the effect with the different context size on the TED Talks (Zh-En) dataset.

As shown in the Table 4, the model use the previous sentence as the context encoder could get the best performance on the TED Talks (Zh-En) dataset. Moreover, more contextual information by concatenate multiple context attention does not appear beneficial and the BLEU score does not get better with longer context sentence.

6.4 Translation Quality

Table 5 shows an example from the TED Talks (Zh-En)101010This example is extracted from line 4,123 of TED Talks (Zh-En)., on which the translation our model is compared to those of other methods. The translation of HAN model is downloaded from Miculicich et al. (2018)’s GitHub111111 This example shows that our proposed model is capable of recognizing the tense and even discourse relation from document-level context.

7 Conclusion and Future Work

We propose a memory network enhancement over Transformer based NMT which provides a natural solution for the requirement of modeling document-level context. Experiments show that our model performs better on the datasets of multiple domains and language pairs and has the ability to capture salient document-level contextual clues and select the most relevant part related to the input sequence from the memory.

In our future work, we consider introducing the discourse information to enhance our model. But it will bring a lot of noise, and the internal structure may be particularly complex. Therefore, it is necessary to effectively abstract its key feature information. The discourse information will provide the heuristic features that will improve the performance during the training and decoding.