Bridging Neural Machine Translation and Bilingual Dictionaries

10/24/2016 ∙ by Jiajun Zhang, et al. ∙ 0

Neural Machine Translation (NMT) has become the new state-of-the-art in several language pairs. However, it remains a challenging problem how to integrate NMT with a bilingual dictionary which mainly contains words rarely or never seen in the bilingual training data. In this paper, we propose two methods to bridge NMT and the bilingual dictionaries. The core idea behind is to design novel models that transform the bilingual dictionaries into adequate sentence pairs, so that NMT can distil latent bilingual mappings from the ample and repetitive phenomena. One method leverages a mixed word/character model and the other attempts at synthesizing parallel sentences guaranteeing massive occurrence of the translation lexicon. Extensive experiments demonstrate that the proposed methods can remarkably improve the translation quality, and most of the rare words in the test sentences can obtain correct translations if they are covered by the dictionary.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The framework of our proposed methods.

Due to its superior ability in modelling the end-to-end translation process, neural machine translation (NMT), recently proposed by [Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever et al.2014], has become the novel paradigm and achieved the new state-of-the-art translation performance for several language pairs, such as English-to-French, English-to-German and Chinese-to-English [Sutskever et al.2014, Bahdanau et al.2014, Luong et al.2015b, Sennrich et al.2015b, Wu et al.2016].

Typically, NMT adopts the encoder-decoder architecture which consists of two recurrent neural networks. The encoder network models the semantics of the source sentence and transforms the source sentence into the context vector representation, from which the decoder network generates the target translation word by word.

One important feature of NMT is that each word in the vocabulary is mapped into a low-dimensional real-valued vector (word embedding). The use of continuous representations enables NMT to learn latent bilingual mappings for accurate translation and explore the statistical similarity between words (e.g. desk and table) as well. As a disadvantage of the statistical models, NMT can learn good word embeddings and accurate bilingual mappings only when the words occur frequently in the parallel sentence pairs. However, low-frequency words are ubiquitous, especially when the training data is not enough (e.g. low-resource language pairs). Fortunately, in many language pairs and domains, we have handmade bilingual dictionaries which mainly contain words rarely or never seen in the training corpus. Therefore, it remains a big challenge how to bridge NMT and the bilingual dictionaries.

Recently, arthur:2016 attempt at incorporating discrete translation lexicons into NMT. The main idea of their method is leveraging the discrete translation lexicons to positively influence the probability distribution of the output words in the NMT softmax layer. However, their approach only addresses the translation lexicons which are in the restricted vocabulary

111NMT usually keeps only the words whose occurrence is more than a threshold (e.g. 10), since very rare words can not yield good embeddings and large vocabulary leads to high computational complexity. of NMT. The out-of-vocabulary (OOV) words are out of their consideration.

In this paper, we aim at making full use of all the bilingual dictionaries, especially the ones covering the rare or OOV words. Our basic idea is to transform the low-frequency word pair in bilingual dictionaries into adequate sequence pairs which guarantee the frequent occurrence of the word pair, so that NMT can learn translation mappings between the source word and the target word.

To achieve this goal, we propose two methods, as shown in Fig. 1. In the test sentence, the Chinese word appears only once in our training data and the baseline NMT cannot correctly translate this word. Fortunately, our bilingual dictionary contains this translation lexicon. Our first method extends the mixed word/character model proposed by wu:2016 to re-label the rare words in both of the dictionary and training data with character sequences in which characters are now frequent and the character translation mappings can be learnt by NMT. Instead of backing off words into characters, our second method is well designed to synthesize adequate pseudo sentence pairs containing the translation lexicon, allowing NMT to learn the word translation mappings.

We make the following contributions in this paper:

  • We propose a low-frequency to high-frequency framework to bridge NMT and the bilingual dictionaries.

  • We propose and investigate two methods to utilize the bilingual dictionaries. One extends the mixed word/character model and the other designs a pseudo sentence pair synthesis model.

  • The extensive experiments on Chinese-to-English translation show that our proposed methods significantly outperform the strong attention-based NMT. We further find that most of rare words can be correctly translated, as long as they are covered by the bilingual dictionary.

2 Neural Machine Translation

Our framework bridging NMT and the discrete bilingual dictionaries can be applied in any neural machine translation model. Without loss of generality, we use the attention-based NMT proposed by [Luong et al.2015b]

, which utilizes stacked Long-Short Term Memory (LSTM,

[Hochreiter and Schmidhuber1997]) layers for both encoder and decoder as illustrated in Fig. 2.

The encoder-decoder NMT first encodes the source sentence into a sequence of context vectors whose size varies with respect to the source sentence length. Then, the encoder-decoder NMT decodes from the context vectors and generates target translation one word each time by maximizing the probability of . Note that () is word embedding corresponding to the () word in the source (target) sentence. Next, we briefly review the encoder introducing how to obtain and the decoder addressing how to calculate .

Figure 2: The architecture of the attention-based NMT which has stacked LSTM layers for encoder and stacked LSTM layers for decoder.

Encoder: The context vectors are generated by the encoder using stacked LSTM layers. is calculated as follows:


Where if .

Decoder: The conditional probability is computed in different ways according to the choice of the context at time . In [Cho et al.2014], the authors choose , while bahdanau:2014 use different context at different time step and the conditional probability will become:


where is the attention output:


The attention model calculates

as the weighted sum of the source-side context vectors, just as illustrated in the middle part of Fig. 2.


where is a normalized item calculated as follows:


is computed using the following formula:


If , will be calculated by combining as feed input [Luong et al.2015b]:


Given the sentence aligned bilingual training data , all the parameters of the encoder-decoder NMT are optimized to maximize the following conditional log-likelihood:


3 Incorporating Bilingual Dictionaries

The word translation pairs in bilingual dictionaries are difficult to use in neural machine translation, mainly because they are rarely or never seen in the parallel training corpus. We attempt to build a bridge between NMT and bilingual dictionaries. We believe the bridge is data transformation that can transform rarely or unseen word translation pairs into frequent ones and provide NMT adequate information to learn latent translation mappings. In this work, we propose two methods to perform data transformation from character level and word level respectively.

3.1 Mixed Word/Character Model

Given a bilingual dictionary , we focus on the translation lexicons if is a rare or unknown word in the bilingual corpus .

We first introduce data transformation using the character-based method. We all know that words are composed of characters and most of the characters are frequent even though the word is never seen. This idea is popularly used to deploy open vocabulary NMT [Ling et al.2015, Costa-Jussà and Fonollosa2016, Chung et al.2016].

Character translation mappings are much easier to learn for NMT than word translation mappings. However, given a character sequence of a source language word, NMT cannot guarantee the generated character sequence would lead to a valid target language word. Therefore, we prefer the framework mixing the words and characters, which is employed by wu:2016 to handle OOV words. If it is a frequent word, we keep it unchanged. Otherwise, we fall back to the character sequence.

We perform data transformation on both parallel training corpus and bilingual dictionaries. Here, English sentences and words are adopted as examples. Suppose we keep the English vocabulary in which the frequency of each word exceeds a threshold . For each English word (e.g. oak) in a parallel sentence pair or in a translation lexicon , if , will be left as it is. Otherwise, is re-labelled by character sequence. For example, oak will be:


Where , and denotes respectively begin, middle and end of a word.

3.2 Pseudo Sentence Pair Synthesis Model

Since NMT is a data driven approach, it can learn latent translation mappings for a word pair if these exist many parallel sentences containing . Along this line, we propose the pseudo sentence pair synthesis model. In this model, we aim at synthesizing for a rare or unknown translation lexicon the adequate pseudo parallel sentences each of which contains .

Although there are no enough bilingual sentence pairs in many languages (and many domains), a huge amount of the monolingual data is available in the web. In this paper, we plan to make use of the source-side monolingual data () to synthesize the pseudo bilingual sentence pairs .

1:bilingual training data ; bilingual dictionary ; source language monolingual data ; pseudo sentence pair number for each ;
2:pseudo sentence pairs :
3:Build an SMT system PBMT on ;
5:for each in  do
6:     Retrieve monolingual sentences containing from ;
7:     Translate into target language sentences using PBMT;
8:     Add into ;
9:end for
Algorithm 1 Pseudo Sentence Pair Synthesis.

For constructing , we resort to statistical machine translation (SMT) and apply a self-learning method as illustrated in Algorithm 1. In contrast to NMT, statistical machine translation (SMT, e.g. phrase-based SMT [Koehn et al.2007, Xiong et al.2006]) is easy to integrate bilingual dictionaries [Wu et al.2008] as long as we consider the translation lexicons of bilingual dictionaries as phrasal translation rules. Following [Wu et al.2008], we first merge the bilingual sentence corpus with the bilingual dictionaries , and employ the phrase-based SMT to train an SMT system called PBMT (line 1 in Algorithm 1).

For each rare or unknown word translation pair , we can easily retrieve the adequate source language monolingual sentences () from the web or other data collections. PBMT is then applied to translate to generate target language translations . As PBMT employs the bilingual dictionaries as additional translation rules, each target translation sentence will contain . Then, the sentence pair will include the word translation pair . Finally, we can pair and to yield pseudo sentence pairs , which will be added into (line 2-6 in Algorithm 1).

The original bilingual corpus and the pseudo bilingual sentence pairs are combined together to train a new NMT model. Some may worry that the target parts of are SMT results but not well-formed sentences which would harm NMT training. Fortunately, sennrich:2015a, cheng:2016b and zhang:2016 observe from large-scale experiments that the synthesized bilingual data using self-learning framework can substantially improve NMT performance. Since now contains bilingual dictionaries, we expect that the NMT trained on cannot only significantly boost the translation quality, but also solve the problem of rare word translation if they are covered by .

Note that the pseudo sentence pair synthesis model can be further augmented by the mixed word/character model to solve other OOV translations.

4 Experimental Settings

Method MT03 MT04 MT05 MT06 MT08 Ave
Moses 30.30 31.04 28.19 30.04 23.20 28.55
Zoph_RNN 38815 30514 34.77 37.40 32.94 33.85 25.93 32.98
Zoph_RNN-mixed 42769 30630 35.57 38.07 34.44 36.07 26.81 34.19
Zoph_RNN-mixed-dic 42892 30630 36.29 38.75 34.86 36.57 27.04 34.70
Zoph_RNN-pseudo () 42133 32300 35.66 38.02 34.66 36.51 27.65 34.50
Zoph_RNN-pseudo-dic () 42133 31734 36.48 38.59 35.81 38.14 28.65 35.53
Zoph_RNN-pseudo () 43080 32813 35.00 36.99 34.22 36.09 26.80 33.82
Zoph_RNN-pseudo-dic () 43080 32255 36.92 38.63 36.09 38.13 29.53 35.86
Zoph_RNN-pseudo () 44162 33357 36.07 37.74 34.63 36.66 27.58 34.54
Zoph_RNN-pseudo-dic () 44162 32797 37.26 39.01 36.64 38.50 30.17 36.32
Zoph_RNN-pseudo () 45195 33961 35.44 37.96 34.89 36.92 27.80 34.60
Zoph_RNN-pseudo-dic () 45195 33399 36.93 39.15 36.85 38.77 30.25 36.39
Zoph_RNN-pseudo-mixed () 45436 32659 38.17 39.55 36.86 38.53 28.46 36.31
Zoph_RNN-pseudo–mixed-dic () 45436 32421 38.66 40.78 38.36 39.56 30.64 37.60
Table 1: Translation results (BLEU score) for different translation methods. denotes that we synthesize 10 pseudo sentence pairs for each word translation pair . The column () reports the vocabulary size limited by frequency threshold (). Note that all the NMT systems use the single model rather than the ensemble model.

In this section we describe the data sets, data preprocessing, the training and evaluation details, and all the translation methods we compare in the experiments.

4.1 Dataset

We perform the experiments on Chinese-to-English translation. Our bilingual training data includes 630K222Without using very large-scale data, it is relatively easy to evaluate the effectiveness of the bilingual dictionaries. sentence pairs (each sentence length is limited up to 50 words) extracted from LDC corpora333LDC2000T50, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17, LDC2004T07.. For validation, we choose NIST 2003 (MT03) dataset. For testing, we use NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) and NIST 2006 (MT08) datasets. The test sentences are remained as their original length. As for the source-side monolingual data , we collect about 100M Chinese sentences in which approximately 40% are provided by Sogou and the rest are collected by searching the words in the bilingual data from the web. We use two bilingual dictionaries: one is from LDC (LDC2002L27) and the other is manually collected by ourselves. The combined dictionary contains 86,252 translation lexicons in total.

4.2 Data Preprocessing

If necessary, the Chinese sentences are word segmented using Stanford Word Segmenter444 The English sentences are tokenized using the tokenizer script from the Moses decoder555 We limit the vocabulary in both Chinese and English using a frequency threshold . We choose for Chinese and for English, resulting and for Chinese and English respectively in . As we focus on rare or unseen translation lexicons of the bilingual dictionary in this work, we filter and retain the ones if , resulting 8306 entries in which 2831 ones appear in the validation and test data sets. All the OOV words are replaced with UNK in the word-based NMT and are re-labelled into character sequences in the mixed word/character model.

4.3 Training and Evaluation Details

We build the described models using the Zoph_RNN666 toolkit which is written in C++/CUDA and provides efficient training across multiple GPUs. In the NMT architecture as illustrated in Fig. 2, the encoder includes two stacked LSTM layers, followed by a global attention layer, and the decoder also contains two stacked LSTM layers followed by the softmax layer. The word embedding dimension and the size of hidden layers are all set to 1000.

Each NMT model is trained on GPU K80 using stochastic gradient decent algorithm AdaGrad [Duchi et al.2011]. We use a mini batch size of and we run a total of 20 iterations for all the data sets. The training time for each model ranges from 2 days to 4 days. At test time, we employ beam search with beam size . We use case-insensitive 4-gram BLEU score as the automatic metric [Papineni et al.2002] for translation quality evaluation.

4.4 Translation Methods

In the experiments, we compare our method with the conventional SMT model and the baseline attention-based NMT model. We list all the translation methods as follows:

  • Moses: It is the state-of-the-art phrase-based SMT system [Koehn et al.2007]. We use its default configuration and train a 4-gram language model on the target portion of the bilingual training data.

  • Zoph_RNN: It is the baseline attention-based NMT system [Luong et al.2015a, Zoph et al.2016] using two stacked LSTM layers for both of the encoder and the decoder.

  • Zoph_RNN-mixed-dic: It is our NMT system which integrates the bilingual dictionaries by re-labelling the rare or unknown words with character sequence on both bilingual training data and bilingual dictionaries. Zoph_RNN-mixed indicates that mixed word/character model is performed only on the bilingual training data and the bilingual dictionary is not used.

  • Zoph_RNN-pseudo-dic: It is our NMT system that integrates the bilingual dictionaries by synthesizing adequate pseudo sentence pairs that contain the focused rare or unseen translation lexicons. Zoph_RNN-pseudo means that the target language parts of pseudo sentence pairs are obtained by the SMT system PBMT without using the bilingual dictionary .

  • Zoph_RNN-pseudo-mixed-dic: It is a NMT system combining the two methods Zoph_RNN-pseudo and Zoph_RNN-mixed. Zoph_RNN-pseudo-mixed is similar to Zoph_RNN-pseudo.

5 Translation Results and Analysis

For translation quality evaluation, we attempt to figure out the following three questions: 1) Could the employed attention-based NMT outperform SMT even on less than 1 million sentence pairs? 2) Which model is more effective for integrating the bilingual dictionaries: mixed word/character model or pseudo sentence pair synthesis data? 3) Can the combined two proposed methods further boost the translation performance?

5.1 NMT vs. SMT

Table 1 reports the detailed translation quality for different methods. Comparing the first two lines in Table 1, it is very obvious that the attention-based NMT system Zoph_RNN substantially outperforms the phrase-based SMT system Moses on just 630K bilingual Chinese-English sentence pairs. The gap can be as large as 6.36 absolute BLEU points on MT04. The average improvement is up to 4.43 BLEU points (32.98 vs. 28.55). It is in line with the findings reported in [Wu et al.2016, Junczys-Dowmunt et al.2016] which conducted experiments on tens of millions or even more parallel sentence pairs. Our experiments further show that NMT can be still much better even we have less than 1 million sentence pairs.

5.2 The Effect of The Mixed W/C Model

The two lines (3-4 in Table 1) presents the BLEU scores when applying the mixed word/character model in NMT. We can see that this model markedly improves the translation quality over the baseline attention-based NMT, although the idea behind is very simple.

Specifically, the system Zoph_RNN-mixed, trained only on the bitext , achieves an average improvement of more than 1.0 BLEU point (34.19 vs 32.98) over the baseline Zoph_RNN. It indicates that the mixed word/character model can alleviate the OOV translation problem to some extent. For example, the number 31.3 is an OOV word in Chinese. The mixed model transforms this word into and it is correctly copied into target side, yielding a correct translation 31.3. Moreover, some named entities (e.g. person name hecker) can be well translated.

When adding the bilingual dictionary as training data, the system Zoph_RNN-mixed-dic further gets a moderate improvement of 0.51 BLEU points (34.70 vs 34.19) on average. We find that the mixed model could make use of some rare or unseen translation lexicons in NMT, as illustrated in the first two parts of Table 2. In the first part of Table 2, the English side of the translation lexicon is a frequent word (e.g. remain). The Chinese frequent character (e.g. ) shares the most meaning of the whole word () and thus it could be correctly translated into remain. We are a little surprised by the examples in the second part of Table 2, since the correct English parts are all OOV words which require each English character to be correctly generated. It demonstrates that the mixed model has some ability to predict the correct character sequence. However, this mixed model fails in many scenarios. The third part in Table 2 gives some bad cases. If the first predicted character is wrong, the final word translation will be incorrect (e.g. take-owned lane vs. overtaking lane). This is the main reason why the mixed model could not obtain large improvements.

Chinese Word Translation Correct
remain remain
owner owner
blaze blaze
placebo placebo
tsunami tsunami
intravenous intravenous
anti-subsidization reactor
lingchiang river huangpu river
take-owned lane overtaking lane
Table 2: The effect of the Zoph_RNN-mixed-dic model in using bilingual dictionaries. The Chinese word is written in Pinyin. The first two parts are positive word translation examples, while the third part shows some bad cases.

5.3 The Effect of Data Synthesis Model

The eight lines (5-12) in Table 1 show the translation performance of the pseudo sentence pair synthesis model. We can analyze the results from three perspectives: 1) the effect of the self-learning method for using the source-side monolingual data; 2) the effect of the bilingual dictionary; and 3) the effect of pseudo sentence pair number.

The results in the odd lines (lines with

Zoph_RNN-pseudo) demonstrate that the synthesized parallel sentence pairs using source-side monolingual data can significantly improve the baseline NMT Zoph_RNN, and the average improvement can be up to 1.62 BLEU points (34.60 vs. 32.98). This finding is also reported by cheng:2016b and zhang:2016.

After augmenting Zoph_RNN-pseudo with bilingual dictionaries, we can further obtain considerable gains. The largest average improvement can be 3.41 BLEU points when compared to the baseline NMT Zoph_RNN and 2.04 BLEU points when compared to Zoph_RNN-pseudo (35.86 vs. 33.82).

When investigating the effect of pseudo sentence pair number (from to ), we find that the performance is largely better and better if we synthesize more pseudo sentence pairs for each rare or unseen word translation pair . We can also notice that improvement gets smaller and smaller when grows.

0.36 0.71 0.76 0.78 0.79
Table 3: The hit rate of the bilingual dictionary for different models.

5.4 Mixed W/C Model vs. Data Synthesis Model

Comparing the results between the mixed model and the data synthesis model (Zoph_RNN-mixed-dic vs. Zoph_RNN-pseudo-dic) in Table 1, we can easily see that the data synthesis model is much better to integrate bilingual dictionaries in NMT. Zoph_RNN-pseudo-dic can substantially outperform Zoph_RNN-mixed-dic by an average improvement up to 1.69 BLEU points (36.39 vs. 34.70).

Through a deep analysis, we find that most of rare or unseen words in test sets can be well translated by Zoph_RNN-pseudo-dic if they are covered by the bilingual dictionary. Table 3 reports the hit rate of the bilingual dictionaries. indicates that () words among the covered rare or unseen words in the test set can be correctly translated. This table explains why Zoph_RNN-pseudo-dic performs much better than Zoph_RNN-mixed-dic.

The last two lines in Table 1 demonstrate that the combined method can further boost the translation quality. The biggest average improvement over the baseline NMT Zoph_RNN can be as large as 4.62 BLEU points, which is very promising. We believe that this method fully exploits the capacity of the data synthesis model and the mixed model. Zoph_RNN-pseudo-dic can well incorporate the bilingual dictionary and Zoph_RNN-mixed can well handle the OOV word translation. Thus, the combined method is the best.

One may argue that the proposed methods use bigger vocabulary and the performance gains may be attributed to the increased vocabulary size. We further conduct an experiment for the baseline NMT Zoph_RNN by setting and . We find that this setting decreases the translation quality by an average BLEU points 0.88 (32.10 vs. 32.98). This further verifies the superiority of our proposed methods.

6 Related Work

The recently proposed neural machine translation has drawn more and more attention. Most of the existing methods mainly focus on designing better attention models [Luong et al.2015b, Cheng et al.2016a, Cohn et al.2016, Feng et al.2016, Liu et al.2016, Meng et al.2016, Mi et al.2016a, Mi et al.2016b, Tu et al.2016], better objective functions for BLEU evaluation [Shen et al.2016], better strategies for handling open vocabulary [Ling et al.2015, Luong et al.2015c, Jean et al.2015, Sennrich et al.2015b, Costa-Jussà and Fonollosa2016, Lee et al.2016, Li et al.2016, Mi et al.2016c, Wu et al.2016] and exploiting large-scale monolingual data [Gulcehre et al.2015, Sennrich et al.2015a, Cheng et al.2016b, Zhang and Zong2016].

Our focus in this work is aiming to fully integrate the discrete bilingual dictionaries into NMT. The most related works lie in three aspects: 1) applying the character-based method to deal with open vocabulary; 2) making use of the synthesized data in NMT, and 3) incorporating translation lexicons in NMT.

ling:2015, costa:2016 and sennrich:2015a propose purely character-based or subword-based neural machine translation to circumvent the open word vocabulary problem. luong:2015c and wu:2016 present the mixed word/character model which utilizes character sequence to replace the OOV words. We introduce the mixed model to integrate the bilingual dictionaries and find that it is useful but not the best method.

sennrich:2015b propose an approach to use target-side monolingual data to synthesize the bitexts. They generate the synthetic bilingual data by translating the target monolingual sentences to source language sentences and retrain NMT with the mixture of original bilingual data and the synthetic parallel data. cheng:2016b and zhang:2016 also investigate the effect of the synthesized parallel sentences. They report that the pseudo sentence pairs synthesized using the source-side monolingual data can significantly improve the translation quality. These studies inspire us to leverage the synthesized data to incorporate the bilingual dictionaries in NMT.

Very recently, arthur:2016 try to use discrete translation lexicons in NMT. Their approach attempts to employ the discrete translation lexicons to positively influence the probability distribution of the output words in the NMT softmax layer. However, their approach only focuses on the words that belong to the vocabulary and the out-of-vocabulary (OOV) words are not considered. In contrast, we concentrated ourselves on the word translation lexicons which are rarely or never seen in the bilingual training data. It is a much tougher problem. The extensive experiments demonstrate that our proposed models, especially the data synthesis model, can solve this problem very well.

7 Conclusions and Future Work

In this paper, we have presented two models to bridge neural machine translation and the bilingual dictionaries in which translation lexicons are rarely or never seen in the bilingual training data. Our proposed methods focus on data transformation mechanism which guarantees the massive and repetitive occurrence of the translation lexicon.

The mixed word/character model tackles this problem by re-labelling the OOV words with character sequence, while our data synthesis model constructs adequate pseudo sentence pairs for each translation lexicon. The extensive experiments show that the data synthesis model substantially outperforms the mixed word/character model, and the combined method performs best. All of the proposed methods obtain promising improvements over the baseline NMT. We further find that more than 70% of the rare or unseen words in test sets can get correct translations as long as they are covered by the bilingual dictionary.

Currently, the data synthesis model does not distinguish the original bilingual training data from the synthesized parallel sentences in which the target sides are SMT translation results. In the future work, we plan to modify the neural network structure to avoid the negative effect of the SMT translation noise.