Learning Efficient Lexically-Constrained Neural Machine Translation with External Memory

01/31/2019 ∙ by Ya Li, et al. ∙ Microsoft Anhui USTC iFLYTEK Co 0

Recent years has witnessed dramatic progress of neural machine translation (NMT), however, the method of manually guiding the translation procedure remains to be better explored. Previous works proposed to handle such problem through lexcially-constrained beam search in the decoding phase. Unfortunately, these lexically-constrained beam search methods suffer two fatal disadvantages: high computational complexity and hard beam search which generates unexpected translations. In this paper, we propose to learn the ability of lexically-constrained translation with external memory, which can overcome the above mentioned disadvantages. For the training process, automatically extracted phrase pairs are extracted from alignment and sentence parsing, then further be encoded into an external memory. This memory is then used to provide lexically-constrained information for training through a memory-attention machanism. Various experiments are conducted on WMT Chinese to English and English to German tasks. All the results can demonstrate the effectiveness of our method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation is one of the most popular natural language processing tasks over the years, which aims to reduce the difficulty of communication between people from various contries. Recent works

(Wu et al., 2016) have given this goal more realistic meaning through providing human-comparable translation performance on certain domains. Among these excellent translation technics, attention mechanism is verified to be the key for improving the quality of translation (Bahdanau et al., 2014; Vaswani et al., 2017).

Although NMT has made great progress, the effective and friendly interactions between human users and NMT systems still remains a problem, which will be mainly discussed in this paper. For example, users may provide specific words or phrases that are more preferable in the translation results. Another real-world scenario is that some terminologies have totally different translation results in different situations, such as bank. Recent works (Hokamp and Liu, 2017; Post and Vilar, 2018; Anderson et al., 2016) regard these problems as lexically-constrianed translation problems and proposed to handle them using lexically-constrained beam search. Unfortunately, existing methods suffer two fatal disadvantages. First, the computational complexities are either exponential (Gehring et al., 2016) or linear to the amount of lexical constraints (Hokamp and Liu, 2017). An approach of computational complexity in the number constraints has been introduced in a recent work (Post and Vilar, 2018). However, a sufficiently large beam search size is needed for decoding with large number of lexical constraints. Second, lexically-constrained beam search is a hard method which may cause problems, such as unexpected translation results with right constraints but leaving the other parts inappropriate.

In this paper, we propose a novel method to overcome the above mentioned defects of existing methods. Instead of decoding with hard constrained beam search, our model is trained to learn the ability of lexically-constrained translation with predefined constraints in an external memory. The translation model is then guided to attend to corresponding contents in the external memory with a memory-attention mechanism. For accurate memory-attention, we propose a softmax loss to force the model to attend to the right slots in the memory. Consider the lack of constraints in public training datasets and the easy implementation of our method, an algorithm of automatically extracting constraints is introduced specifically. Compared with previous works, our proposed method mainly has two contributions:

  • The decoding of our model is as efficient as standard beam search, which can handle large number of lexical constraints with only a normal beam size.

  • Our lexically-constrained translation is more flexible, which can avoid unexpected translation results even if improper constraints are provided or results with right constraints but leaving the other parts unacceptable.

We organize the rest of the paper as follows. Section 2 briefly introduces the background of neural machine translation and lexically-constrained beam search. In section 3, the architecture of training our lexically-constrained model is given and the whole process is detailedly explained. And various empirical experiments are conducted in section 4. Section 5 discusses related works and section 6 concludes the paper.

2 Lexically-Constrained Beam Search

Figure 1: The architecture of our memory-augmented transformer encoder. The constraint pairs are first encoded into an external memory with respect to keys and values. Then the encoder learns to integrate the corresponding contents in the memory with the source inputs through our memory-attention machanism.

Recent NMT models are mainly based on encoder-decoder neural network architectures, such as RNN-based translation model

(Bahdanau et al., 2014) and self-attention based transformer model (Vaswani et al., 2017). Both of them have dramatically accelerated the development of neural machine translation.

For notation simplicity, bold lower case denotes for vectors, capital letters represent sentence sequences. And lower case denotes the individual token in a sequence. Let

represents a sentence pair. denotes the source sentence of length and denotes the corresponding target sentence of length . The encoder of the NMT aims to encode the source sentences into vectors of context representations as follows:


where are the corresponding context representations of source sentence and is the encoder function. With the context representations, the decoder learns to translate the next target word with previous generated results as the following:


where denotes the decoder function and

represents the attention mechanism. With the above equations, the generated probability of target sequence

can be formulated as follows:


The training goal of NMT is to maximize the probility with respect to functions .

In the inference phase, it is time consuming and requires a large memory space if we want to find the best result from the whole search space. If the vocabulary size is , the size of the search space is , where

is the length of the target. Fortunately, beam search is proposed to efficiently approximate the best result using a heuristic search algorithm, which selects the top

best beams every inference time. Note that is regarded as the beam size.

Lexically-constrained beam search is a variant of normal beam search which can generate results containing predefined constraints. However, this may sacrifice the efficiency and accuracy of decoding. For example, the decoding complexity of Grid Beam Search (GBS) (Hokamp and Liu, 2017) is linear to the number of constraints with effective beam size of . Alghough the dynamic beam allocation (DBA) (Post and Vilar, 2018) has a similar time complexity to the normal beam search, it requires a large beam size which is indirectly propotional to the number of constraints. Additionally, the lexcially-constrained beam search forces the decoding results to contain constraints, which dramatically limit the search space of beam search especially with long phrase constraints. Consequently, the performance of the final generated results is degraded.

3 Lexically-Constrained Neural Machine Translation with External Memory

Figure 2: The pipeline of extracting phrase pairs.

In this section, we discuss the details of our lexically-constrained neural machine translation with external memory. Instead of forcing the beam search to generate results with constraints in the inference time, we propose to learn such an ability of translating with lexical constraints. Let be the external memory of constraint pairs . Note that is the -th lexically-constrained pair with source phrase to be translated to target phrase and represents the number of constraints. One constraint or may include several words. The intuition behind our algorithm is that the constraints are expected to be integrated into the context representations in the encoder, such that


where is the attention mechanism to the memory of the constraints and is the memory-augmented context representations.

When decoding, the constrained translated results are generated using the information from the memory-augmented context representations. The training goal is to maximize the following probability with respect to functions :


3.1 Architecture

In this paper, the architecture of our lexically-constrained neural machine translation is based on Transformer (Vaswani et al., 2017), which is one of the most popular NMT models. However, our proposed algorithm can also be implemented with RNN-based translation model.

The detailed architecture is shown in Figure 1. Comparing to the standard transformer architecture, we add one external memory block with an attention-loss and one memory-attention layer in the encoder. The other parts and the decoder of our model is the same as those of standard transformer. The external memory block aims to encode each constraint pair into a key and value pair . The encoding of constraint pairs can be simply implemented using the average of word embeddings in the phrase, such that


where and denote the embedding function of source language and target language. refers to the length of the source constraint and means is one token in source constraint . Note that the source constaints and the source inputs share the same embedding function, and so does the target constraints and the target inputs. Indeed the encoding of the constraints can also be implemented with LSTM or multi-layers of self-attention, however, the average of word embeddings can work well enough from our empirical results. Additionally, using average of word embeddings increases less training parameters and has a higher training efficiency than LSTM or multi-layers of self-attention.

3.2 Automatically Extracted Constraints

As shown in the architecture, the training of our lexically-constrained NMT needs corresponding constraints for input and ouput sequences. Unfortunately, existing training datasets of NMT hardly have these constraints because of large cost of time. Consequently, we propose an automatical method for extracting constraint pairs for any bilingual datasets.

Figure 2 shows an example of extracting phrase pairs from bilingual training data and the extracted phrase pairs are then used as constraints for training the lexically-constrained NMT. The process of extracting phrases mainly includes two parts: alignment and parsing. The sentence alignment is used to extract possible phrase pairs. If several alignments from source to target are consistent in positions, the combination of these alignments is viewed as one phrase pair. For example, the alignments (sinian missed, ni you) should be combined into one phrase pair (sinian ni, missed you). However, some extracted phrase pairs are noisy, which are misaligned or not meaningful. To avoid such phrase pairs, we propose to filter the phrase pairs extracted from alignment through sentence parsing. Sentence parsing can provide more canonical phrases, such as noun phrase, verb phrase etc. If the phrases extracted from alignment also exists in the candidate phrases from corresponding parsing, these phrases are reserved as final constraints for the current sentence pair.

3.3 Memory-Attention Mechanism

The added memory-attention layer in the encoder block is one multi-head attention layer (Vaswani et al., 2017) with encodings of the inputs as queries and from memory as keys and values.

For training, the constraints of the sentences from one batch are simultaneously encoded into the external memory. Therefore, the memory-attention mechanism needs to find the most correlated constraint with respect to the inputs among all constraints in the batch. Suppose there are total constraints including normal constraints and one special constraint in the -th slot of the memory. This special constraint provides no information and is used for input tokens which have no related constraints in the memory. Let be a matrix of all keys with each vector as the column. Additionally, matrix represents all values. Note that represents the dimention of keys and values. For any query of the input token , the memory-attention can be formulated as one scaled dot-product attention.


Note that is a -dimensional probability vector , which represents the attention weights between and .

The goal of the memory-attention mechanism is to attend to the most correlated slot in the memory and integrates the context information of the target constraints into the corresponding queries. If is a large number, the memory can provide much information about the translated results. Consequently, the model can be easily learned to be over-reliant on the memory and the performance will decrease without any constraints provided comparing to standard transformer. Conversely, the model will ignore the information in the memory if is too small. To avoid such situation, we propose an additional attention loss for the memory-attention mechanism as illustrated in Figure 1.

For each token in the inputs, we generate a label for memory-attention. This label indicates the index of the slot in the memory, to which the token should attend. Note that if one token has no corresponding slot in the memory, it will attend to the special slot . Consequently, the attention label of this token is . The additional attention loss for token can be formulated as follows:


is the number of tokens in the batch and refers to the attention probability of the -th slot for token . This loss is then minimized together with the main loss in the output of the decoder. With the proposed memory attention loss, the attention ability can be learned more efficiently and accurately even if with just a small amount of constraints.

Table 1: Statistics for and with respect to the number of sentences, number of phrases, number of words in all phrases and number of subwords in all phrases from the target language.
Block ID Zh En En De
Table 2: BLEU scores comparison between models with different encoder blocks added by a memory-attention layer.
Method Base DBA LCNMT
Table 3: BLEU scores comparison of different methods on . Three different raitos , , of automatically-extracted phrases are randomly selected as constraints.
Method Base DBA LCNMT
Table 4: BLEU scores comparison of different methods on . Three different raitos , , of automatically-extracted phrases are randomly selected as constraints.

4 Experiments

This section shows various experimental results to demonstrate the effectiveness of our proposed lexically-constrained neural machine translation (LCNMT). Two translation directions, Chinese English and English German, are trained on the WMT’17 corpora (Bojar et al., 2017). The corpora is first processed by filtering noisy bilingual sentences, such as sentence pairs with abnormal length ratio, sentences pairs with target language the same as the source language, and sentence pairs appear in the corpora multiple times. For English and German corpora, we tokenize them with Moses tokenizer 111http://www.statmt.org/moses/. And the Chinese corpora is tokenized with LTP tokenizer 222https://github.com/HIT-SCIR/ltp. All tokenized corpora is then processed with sub-word Sennrich et al. (2015) using 40k merge operations. We implement our algorithm based on Transformer from Tensor2Tensor 333https://github.com/tensorflow/tensor2tensor. All experiments are run on 4 GPUs using a base model with batch size of 9000 tokens and the BLEU scores are evaluated on detokenized results using SAREBLEU Post (2018). As the efficiency of GBS (Hokamp and Liu, 2017) is low and DBA (Post and Vilar, 2018) has a comparable performance to GBS, therefore, our LCNMT is compared with two methods, base model of transformer (Vaswani et al., 2017) and DBA (Post and Vilar, 2018).

4.1 Setup of Memory-Attention

To keep the training efficiency close to that of the standard transformer, just one of the encoder block is required to have a memory-attention layer. To confirm the best encoder block to attend to the memory, we first train models with different encoder blocks added by a memory-attention layer.

We test our Chinese English models on newstest2017 and our English German models on newstest2014. For notation simplicity, we denote newstest2017 for Chinese English as and newstest2014 for English German as . Table 1 shows some statistics of and with respect to the number of sentences, number of extracted phrases, number of words in all phrases and number of subwords in all phrases from the target language. The phrases are extracted as discussed in section 3.2. From the statistics, the average numbers of subword constraints for and per sentence are computed as and . Note that some sentences may have no phrases because of no accurate alignments found with alignment and parising.

We randomly select of the extracted phrases as constraints and compare the performance of the models with different blocks added by a memory-attention layer. The experimental results are shown in Table 2. From the results, we can conclude that adding the memory-attention layer to the second block of the encoder is the best choice.

Source: We don’t have to rush into surgery that is irreversible.
Constraints: sich nicht mehr rückgängig machen lässt.
Reference: Und man muss nicht übereilt eine Operation vornehmen, die sich nicht mehr rückgängig
machen lässt.
Base: Wir müssen uns nicht in eine Operation stürzen, die unumkehrbar ist.
DBA: Wir müssen nicht in eine irreversible Operation voreilig greifen. sich nicht mehr rückgängig
machen läss. & # 160; & # 160; & # 160; & # 160;
LCNMT: Wir müssen uns nicht in eine Operation stürzen, die sich nicht mehr rückgängig machen
Table 5: An example of results comparison between different methods for task English German.

4.2 Performance on WMT

According to the above analysis, the memory-attention layer will be added to the second encoder block. We evaluate all methods on both and . For comparison of the performance with various number of constraints, we randomly select different ratios of the phrases as constraints for decoding. For all methods, beam size 12 is used for decoding which is sufficient for beam search.

Table 3 and Table 4 shows the performance comparison of all methods in terms of BLEU scores on and respectively. For task English German, we can conclude that all lexically-constrained methods can outperform baseline method. Additionally, our proposed LCNMT performs the best with different ratios of constraints. For task Chinese English, our LCNMT has a comparable performance with DBA. Our LCNMT is a soft lexically-constrained translation method which can generate constrained results with a large probability and guarantee the fluency of the results simutaneously. An example is given in Table 5. From the results, we find that all methods can generate results with given constraints except the baseline method. However, the generated results of DBA are less fluent which contains some unexpected generations.

5 Related Work

The establishment of one efficient and effective machine translation system is attractive over the decades. Although systems based on statistical machine translation (Callison-Burch and Koehn, 2005) have been used in real life, the unpromising performance makes it difficult to be promoted. Recent works (Cho et al., 2014; Gehring et al., 2016; Bahdanau et al., 2014; Vaswani et al., 2017; Lample et al., 2017, 2018) of neural machine translation have made this possible. (Bahdanau et al., 2014) proposed an attention mechanism for encoder-decoder neural machine translation system, which can sufficiently explore the context representation in the source sentences. Transformer Vaswani et al. (2017) is a more promising neural machine translation architecture with self-attention, which can achieve faster training speed and better performance.

Several works (Anderson et al., 2016; Hokamp and Liu, 2017; Post and Vilar, 2018) have discussed lexically-constrained beam search from different aspects. (Anderson et al., 2016) applies constrained beam search to image caption tasks, which aims to handle out-of-domain scenes or objects. An finite-state machine with the states representing the completed constraints is cooperated with beam search. However, the decoding complexity is exponential to the number of constraints. An improved Grid Beam Search method is proposed in (Hokamp and Liu, 2017), which extends an individual beam to the size of number of constraints for exhaustively searching results with completed constraints. The time complexity of GBS is linear to the number of constraints. Additionally, parallel implementation of GBS is troublesome because of variant beam size caused by different number of constraints for each sentence. (Post and Vilar, 2018) makes a significant improvement over GBS by dynamic beam allocation, which can reduce the time complexity to in the number of constraints. Unfortunately, the beam size is required to be much larger than the number of constraints and this hard beam search can sometimes generate strange results.

External memory has been used in several works Zhao et al. (2018); Meng et al. (2018); Feng et al. (2017) to enhance the quality of neural machine translation. For example, (Zhao et al., 2018) proposes to extract phrase table as recommendation memory for neural machine translation. However, this kind of phrase table is too noisy, which is also mentioned in (Post and Vilar, 2018). (Feng et al., 2017) proposes to store the hidden context information into the memory, which can be used to calculate an additional probability of target word. Both of these two methods require a high quality translation alignment. Pham et al. (2018) proposes to annotate the source sentences with experts and use a copy-generator for rare word translation. However, the strong copy ability may cuase the loss of fluency. And (Meng et al., 2018) aims to improve the performance of NMT by maintaining a updatable memory.

6 Conclusions

In this paper, we propose an algorithm of training lexically-constrained translation with external memory. Compared with DBA, our method can decode more efficiently with a soft lexically-constrained memory. For better implementation of our method, we propose a procedure for automatically extracting phrases which can provide constraints for any bilingual corpus. An memory-attention loss is ultilized to force accurate memory-attention with a small amount of constraints. Experimental results can demonstrate the effectiveness of our LCNMT.