Neural machine translation [bahdanau2015neural, sutskever2014sequence, vaswani2017attention, gehring2017convolutional] captures the knowledge of the source and target language along with their correspondences as part of the encoder and decoder parameters learned from data. With this embedded and parameterized knowledge, a trained NMT model is able to translate a new source sentence into the target language.
In this paper, we consider a different translation scenario to NMT. In this scenario, in addition to a given source sentence, NMT is also provided with an example translation that contains reusable translation segments for the source sentence. The NMT model can either use the embedded knowledge in parameters or learn from the example translation on the fly to predict target words. This translation scenario is not new to machine translation as it has been studied in example-based machine translation [nagao1984framework] and the combination of statistical machine translation (SMT) with translation memory [koehn2010convergence]. However, in the context of NMT, the incorporation of external symbol translations is still an open problem. We therefore propose example-guided NMT (EGNMT) to seamlessly integrate example translations into NMT.
Unlike conventional machine translation formalisms, a trained NMT model is not easy to be quickly adapted to an example translation as the model is less transparent and amenable than SMT models. To address this issue, we use a new encoder (thereafter the example encoder) to encode the example translation in EGNMT, in addition to the primary encoder for the source sentence.
As the example is not identical to the source sentence, only parts of the example translation can be used in the final translation for the source sentence. Hence the challenge is to teach EGNMT to detect and use matched translation fragments while ignoring unmatched noisy parts.
To handle this challenge, we propose two models that guide the decoder to reuse translations from examples. The first model is a noise-masked encoder model (NME). In the example encoder, we pinpoint unmatched noisy fragments in each example translation via word alignments and mask them out with a symbol “”. The noise-masked example translation is then input to the example encoder. This model mimics human translators in paying special attention to reusable parts and ignoring those unrelated parts when an example translation is given.
Different from NME that encodes the noise-masked example translation, in the second model, we directly produce a masked translation from the example translation with an auxiliary decoder (hence the auxiliary decoder model, AD). We compare the reference translation of a source sentence in the training data with its corresponding example translation. The identical parts in the reference translation are retained while other parts are substituted with the symbol “”. The auxiliary decoder is then used to predict the masked reference translation. It is jointly trained with the primary decoder and shares its parameters with the primary decoder. Therefore the primary decoder can learn from the auxiliary decoder to predict reusable words/phrases from the example translation. Notice that the auxiliary decoder is only used during the joint training phase.
In summary, our contributions are threefold.
We propose an example-guided NMT framework to learn to reuse translations from examples.
In this framework, we further propose two models: NME that encodes reusable translations in the example encoder and AD that teaches the primary decoder to directly predict reusable translations with the auxiliary decoder via parameter sharing and joint training.
The proposed EGNMT framework can be used to any encoder-decoder based NMT. In this paper, we define EGNMT over the state-of-the-art NMT architecture Transformer [vaswani2017attention] and evaluate EGNMT on Chinese-English, English-German and English-Spanish translation. In our experiments, the best EGNMT model achieves improvements of 4-9 BLEU points over the baseline on the three language pairs. Analyses show that the proposed model can effectively learn from example translations with different similarity scores.
2 Related Work
Translation Memory Our work is related to the studies that combine translation memory (TM) with machine translation. Various approaches have been proposed for the combination of TM and SMT. For example, Koehn and Senellart [koehn2010convergence] propose to reuse matched segments from TM for SMT. In NMT, Gu et al. [gu2017search]
propose to encode sentences from TM into vectors, which are then stored as key-value pairs to be explored by NMT. Cao and Xiong[cao2018encoding] regard the incorporation of TM into NMT as a multi-input problem and use a gating mechanism to combine them. Bapna and Firat [bapna2019non] integrate multiple similar examples into NMT and explore different retrieval strategies. Different from these methods, we propose more fine-grained approaches to dealing with noise in matched translations.
Example-based MT In the last century, many studies have focused on the impact of examples on translation, or translation by analogy [nagao1984framework, somers1999example]. Wu [wu2005mt] discuss the relations of statistical, example-based and compositional MT in a three-dimensional model space because of the interplay of them. Our work can be considered as a small step in this space to integrate the example-based translation philosophy with NMT.
Using examples in neural models for other tasks
In other areas of natural language processing, many researchers are interested in combining symbolic examples with neural models. Pandey et al.[pandey2018exemplar] propose a conversational model that learns to utilize similar examples to generate responses. The retrieved examples are used to create exemplar vectors that are used by the decoder to generate responses. Cai et al. [cai2018skeleton] also introduce examples into dialogue systems, but they first generate a skeleton based on the retrieved example, and then use the skeleton to serve as an additional knowledge source for response generation. Guu et al. [guu2018generating] present a new generative language model for sentences that first samples a prototype sentence and then edits it into a new sentence.
External knowledge for NMT Our work is also related to previous works that incorporate external knowledge or information into NMT. Zhou et al. [zhou2017neural] propose to integrate the outputs of SMT to improve the translation quality of NMT while Wang et al. [wang2017neural] explore SMT recommendations in NMT. Zhang et al. [zhang2018guiding] incorporate translation pieces into NMT within beam search. In document translation, many efforts try to encode the global context information by the aid of discourse-level approaches [kuang2018fusing, zhang2018improving, voita2018context]. In addition to these, some studies integrate external dictionaries into NMT [arthur2016incorporating, li2016towards] or force the NMT decoder to use given words/phrases in target translations [hokamp2017lexically, post2018fast, hasler2018neural].
Multi-task learning The way that we use the auxiliary decoder and share parameters is similar to multi-task learning in NMT. Just to name a few, Dong et al. [dong2015multi] share an encoder among different translation tasks. Weng et al. [weng2017neural] add a word prediction task in the process of translation. Sachan and Neubig [sachan2018parameter] explore the parameter sharing strategies for the task of multilingual machine translation. Wang et al. [wang2018learning] propose to jointly learn to translate and predict dropped pronouns.
3 Guiding NMT with Examples
The task here is to translate a source sentence into the target language from the representations of the sentence itself and a matched example translation. In this section, we first introduce how example translations are retrieved and then briefly describe the basic EGNMT model that uses one encoder for source sentences and the other for retrieved example translations. Based on this simple model, we elaborate the proposed two models: the noise-masked encoder model and auxiliary decoder model.
3.1 Example Retrieval
Given a source sentence to be translated, we find a matched example from an example database with source-target pairs. The source part of the matched example has the highest similarity score to in
. A variety of metrics can be used to estimate this similarity score. In this paper, we first get the top n example translations by off-the-shelf search engine, and then we calculate the cosine similarity between their sentence embeddings and select the highest one as the matched example. Details will be introduced in the experiment section. Later, in order to easy to understand the similarity between the matched example and the source sentence, we also introduce the Fuzzy Match Score[koehn2010convergence] as a measurement, which is computed as follows:
3.2 Basic Model
Figure 1 shows the architecture for the basic model built upon the Transformer. We use two encoders: the primary encoder for encoding the source sentence and the example encoder for the matched example translation . The primary encoder is constructed following Vaswani et al. [vaswani2017attention]:
The example encoder contains three sub-layers: a multi-head example self-attention layer, a multi-head source-example attention layer and a feed-forward network layer. Each sublayer is followed by a residual connection and layer normalization.
Before we describe these three sublayers in the example encoder, we first define the embedding layer. We denote the matched example translation as where is the length of . The embedding layer is then calculated as:
where is the word embedding of and PE is the positional encoding function.
The first sub-layer is a multi-head self-attention layer formulated as:
The second sub-layer is a multi-head source-example attention which can be formulated as:
where is the output of the primary encoder. This sublayer is responsible for the attention between the matched example translation and the source sentence. The third sub-layer is a feed-forward network defined as follows:
Different from the primary encoder with 6 layers, the example encoder has only one single layer. In our preliminary experiments, we find that a deep example encoder is not better than a single-layer shallow encoder. This may be due to the findings of recent studies, suggesting that higher-level representations in the encoder capture semantics while lower-level states model syntax [peters2018deep, anastasopoulos2018tied, dou2018exploiting]. As the task is to borrow reusable fragments from the example translation, we do not need to fully understand the entire example translation. We conjecture that a full semantic representation of the example translation even disturbs the primary encoder to convey the meaning of the source sentence to the decoder.
In the decoder, different from Vaswani et al. [vaswani2017attention], we insert an additional sub-layer between the masked multi-head self-attention and encoder-decoder attention. The additional sublayer is built for the attention of the decoder to the example translation representation:
where is the output of the masked multi-head self-attention, and is the output of the example encoder. This sub-layer also contains residual connection and layer normalization.
3.3 Noise-Masked Encoder Model
As the source part of the matched example is not identical to the source sentence , parts of the example translation cannot be reused in producing the target translation for . These unmatched parts may act like noisy signals to disturb the translation process of the decoder. In order to prevent these unmatched parts from interrupting the target prediction, we propose a noise-masked encoder to encode the example translation. The idea behind this new encoder is simple. We detect the unmatched parts in the example translation and use a symbol “” to replace them so as to mask out their effect on translation. The masking process can be defined as a function , from which we have the noise-masked example translation from .
The masking function can be visualized with an example shown in Figure 2. Comparing the source side of the matched example with the source sentence, we can find repeated source words. Keeping the repeated words and replacing other words with “”, we obtain the masked version . Then, we use a pre-trained word alignment model to obtain word alignments between and . We replace words in that are aligned to the masked parts in with “”. In this way, we finally obtain the masked example translation where only reusable parts are retained.
This masking method is based on word alignments. In practice, inaccurate word alignments will cause reusable words to be filtered out and noisy words retained. In order to minimize the negative impact of wrong word alignments as much as possible, we employ a standard transformer encoder module to encode the original example translation:
Hence the differences between the example encoder in the basic model and NME model are twofold: (1) we replace the input with ; (2) we add a sub-layer between the multi-head self-attention and source-example attention, to attend to the original example translation:
where is the output of the multi-head self-attention. The architecture can be seen in Figure 3.
3.4 Auxiliary Decoder Model
In order to better leverage useful information in original example translations, we further propose an auxiliary decoder model. In this model, we directly compare the example translation with the target translation . We can easily detect translation fragments that occur both in the example and real target translation. Similarly, we mask out other words to get a masked version of the target translation (see the last row in Figure 2).
As the gold target translation is only available during the training phase, we employ an auxiliary decoder in the new model which is shown in Figure 4. The purpose for the auxiliary decoder is to predict the masked target translation during the training phase from the example translation and . It can be formulated as:
For this, we need to train an auxiliary NMT system with training instances . The primary NMT system is trained with . We jointly train these two systems to minimize a joint loss as follows:
where is the loss for the primary NMT system while the latter is for the auxiliary NMT system.
During the testing phase, the auxiliary decoder is removed. We therefore share the parameters of the auxiliary decoder with the primary decoder. This is important as it allows the primary decoder to learn from the auxiliary decoder in the training phase to generate reusable parts. The joint training makes the primary decoder pay more attention to the reusable parts in the example translation by adjusting parameters in the attention network between the example encoder and the primary decoder to right directions.
3.5 Assembling NME and AD
The noise-masked encoder model and auxiliary decoder model can be combined together. In this assembling, we not only mask out noise parts in example translations in the encoder but also use the masked example translation to predict the masked target translation in the auxiliary decoder.
We conducted experiments on Chinese-English, English-German and English-Spanish translation to evaluate the proposed models for EGNMT.
4.1 Experimental Settings
We implemented our example-guided NMT systems based on Tensorflow. We obtained word alignments with the tool fast-align111Available at: https://github.com/clab/fast_align. The maximum length of training sentences is set to 50 for all languages. We applied byte pair encoding [sennrich2016neural]
with 30k merging operations. We used the stochastic gradient descent algorithm with Adam[kingma2015adam]
to train all models. We set the beam size to 4 during decoding. We used two GPUs for training and one for decoding. We used case-insensitive 4-gram BLEU as our evaluation metric[papineni2002bleu] and the script “multi-bleu.perl” to compute BLEU scores.
For Chinese-English corpus, we used the United National Parallel Corpus [rafalovitch2009united] from Cao and Xiong [cao2018encoding], which consists of official records and other parliamentary documents. The numbers of sentences in the training/development/test sets are 1.1M/804/1,614.
We also experimented our methods on English-German and English-Spanish translation. We used the JRC-Acquis corpus222Available at https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis following previous works [koehn2010convergence, gu2017search, bapna2019non]. We randomly selected sentences from the corpus to build the training/development/test sets. The numbers of sentences in the training/development/test sets for English-German are 0.5M/676/1,824 and 0.8M/900/2,795 for English-Spanish. We used the training sets as the example database. We firstly used the Lucene333Available at http://lucene.apache.org/ to retrieve top 10 example translations from the example database excluding the sentence itself. Then we obtained the sentence embeddings of these retrieved examples with the fasttext tool444Available at: https://fasttext.cc/ and calculated the cosine similarity between the source sentence and each retrieved example. Finally we selected the example with the highest similarity score as the matched example.
4.2 Chinese-English Results
Table 1 shows the results. In the table, we divide the test set into 9 groups according to the FMS values of matched example translations and show BLEU scores on each group and the entire set. We show the BLEU scores for both the baseline and matched example translations against reference translations for comparison. Additionally, we adapted the gated method proposed by Cao and Xiong [cao2018encoding] to the Transformer and compared with this gated Transformer model. The results of this experiment are also reported in Table 1. From the table, we can observe that
The basic model obtains an improvement of 2.78 BLEU points over the baseline. This demonstrates the advantage of example-guided NMT: teaching NMT to learn from example translations on the fly is better than mixing examples as training data. We also find that the basic model can improve translation quality only when FMS is larger than 0.5, indicating that it suffers from noises in low-FMS example translations.
The noise-masked encoder model is better than the basic model by 0.27 BLEU points. The model significantly improves translation quality for sentences with low-FMS example translations, which means that masking noise is really helpful. But it also slightly hurts translation quality for high-FMS (e.g., 0.5) sentences compared with the basic model. This may be because the noisy parts are much more dominant than the reusable parts in example translations with low FMS, which makes easier to detect and mask out noisy parts via word alignments. However, in high-FMS example translations, many words can be reused with a few unmatched words scattered in them. It is therefore risky to detect and mask out reusable words with inaccurate word alignments. Although we also attend to the original example translation, reusable words that are masked mistakenly may still not be replenished.
The auxiliary decoder model hugely improves the performance by more than 5.68 BLEU points over the basic model. It significantly improves translation quality for high-FMS sentences by learning to reuse previously translated segments separated by scattered unmatched words. However, in low FMS intervals, its performance is still not satisfactory for that they may not distinguish the unmatched parts accurately.
Assembling the noise-masked encoder and auxiliary decoder models together, we achieve the best performance, 7.09 BLEU points higher than the basic model and 3.01 BLEU points than the previous gated Transformer model [cao2018encoding]. We can improve translation quality for both high-FMS and low-FMS sentences. This is because, on the one hand, we can mask the noisy information in the example by the NME model, on the other hand, through the AD model, we can learn to let the model use the useful information. The AD model can also guide the NME model in the attendance to the original example.
4.3 Results for English-German and English-Spanish Translation
We further conducted experiments on the English-German and English-Spanish corpus. Results are shown in Table 2 and 3. We have similar findings to those on Chinese-English translation. Our best model achieves improvements of over 4 BLEU points over the basic EGNMT model. The improvements in these two language pairs are not as large as those in Chinese-English translation. The reason may be that the retrieved examples are not as similar to German/Spanish translations as those to English translations in the Chinese-English corpus. This can be verified by the BLEU scores of matched example translations in Chinese-English, English-German and English-Spanish corpus, which are 47.32/36.51/38.72 respectively. The more matched example translations are similar to target translations, the higher improvements our model can achieve.
We look into translations generated by the proposed EGNMT models to analyze how example translations improve translation quality in this section.
5.1 Analysis on the Generation of Reusable Words
|source||feizhoudalu de wuzhuangchongtu , genyuan daduo yu pinkun ji qianfada youguan .|
|reference||most armed conflicts on the african continent are rooted in poverty and under-development .|
|youguan feizhou guojia de wuzhuangchongtu , jiu@@ qi@@ genyuan daduo yu pinkun he qianfada youguan .|
|most armed conflicts in and among african countries are rooted in poverty and lack of development .|
|Transformer||most of the armed conflicts on the continent are related to poverty and the less developed countries .|
|Basic Model||most armed conflicts in the african continent are related to poverty and lack of development .|
|Final model||most armed conflicts in the african continent are rooted in poverty and lack of development .|
We first compared matched example translations against reference translations in the Chinese-English test set at the word level after all stop words are removed. Table 4 shows the number of matched and unmatched noisy words in example translations. The noise-masking procedure can significantly reduce the number of noisy words (9,353 vs. 1,627). 8.1% of matched words in the original example translations are filtered out due to wrong word alignments.
We collected a set of reusable words that are present in both example and reference translations (all stop words removed). Similarly, we obtained a set of words that occur in both example and system translations. The words in can be regarded as words generated by EGNMT models under the (positive or negative) guidance of example translations. The intersection of and is the set of words that are correctly reused from example translations by EGNMT models. We computed an F metric for reusable word generation as follows:
Figure 5 shows the F scores for different EGNMT models. It can be seen that the proposed EGNMT models is capable of enabling the decoder to generate matched words from example translations while filtering noisy words.
The reason that the auxiliary decoder model achieves the lowest F for low-FMS sentences is because the model reuses a lot of noisy words from low-FMS example translations (hence the precision is low). This indicates that low-FMS example translations have a negative impact on the AD model. The NME model is able to achieve a high precision by masking out noisy words but with a low recall for high-FMS examples by incorrectly filtering out reusable words. Combining the strengths of the two models, we can achieve high F scores for both low- and high-FMS examples as shown in Figure 5 (the final model).
5.2 Attention Visualization and Analysis
Table 5 provides a sample from the Chinese-English test set. We can see that the example translation provides two fragments that are better than the target translation generated by the baseline model. The fragment “most armed conflicts” is successfully reused by the basic model, but the fragment “are rooted in poverty” does not appear in the target translation generated by the basic model. In contrast to the two models, our final model successfully reuses the two fragments.
We further visualize and analyze attention weights between the example translation and system translation (the example encoder vs. the primary decoder). The visualization of attention weights for this sample is shown in Figure 6. Obviously, the basic EGNMT model can use only a few reusable words as the attention weights scatter over the entire example translation rather than reusable words. The final EGNMT system that uses both the noise-masked encoder and auxiliary decoder model, by contrast, correctly detects all reusable words and enables the decoder to pay more attention to these reusable words than other words.
6 Conclusions and Future Work
In this paper, we have presented EGNMT, a general and effective framework that enables the decoder to detect and take reusable translation fragments in generated target translations from the matched example translations. The noise-masking technique is introduced to filter out noisy words in example translations. The noise-masking encoder and auxiliary decoder model are proposed to learn reusable translations from low- and high-FMS example translations. Both experiments and analyses demonstrate the effectiveness of EGNMT and its advantage over mixing example translations with training data.
It is natural to use the proposed EGNMT to combine NMT with translation memory. Although not explored, the EGNMT framework can also be used to adapt an NMT system to a domain with very little in-domain data by treating the in-domain data as the example database. In addition to these, there are still many open problems related to the use and abstraction of examples in neural machine translation that have not been addressed in this paper. In particular, our work can be extended in the following two ways.
Example-based NMT with multiple examples. We only use the best example with the highest similarity score. However, we can find many examples with reusable fragments. A seamless combination of example-based translation philosophy with NMT is necessary for example-based NMT to benefit from multiple examples.
Integration of translation rules or templates into NMT. The noise-masked example in this paper can be considered as a template abstracted from a single example. If we learn translation templates or rules from multiple examples or the entire training data, the noise-masked encoder model or the auxiliary decoder model might be adjusted to incorporate them into NMT.