Triangular Architecture for Rare Language Translation

05/13/2018 ∙ by Shuo Ren, et al. ∙ 0

Neural Machine Translation (NMT) performs poor on the low-resource language pair (X,Z), especially when Z is a rare language. By introducing another rich language Y, we propose a novel triangular training architecture (TA-NMT) to leverage bilingual data (Y,Z) (may be small) and (X,Y) (can be rich) to improve the translation performance of low-resource pairs. In this triangular architecture, Z is taken as the intermediate latent variable, and translation models of Z are jointly optimized with a unified bidirectional EM algorithm under the goal of maximizing the translation likelihood of (X,Y). Empirical results demonstrate that our method significantly improves the translation quality of rare languages on MultiUN and IWSLT2012 datasets, and achieves even better performance combining back-translation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Neural Machine Translation (NMT) Kalchbrenner and Blunsom (2013); Sutskever et al. (2014); Bahdanau et al. (2014) has achieved remarkable performance on many translation tasks (Jean et al., 2015; Sennrich et al., 2016; Wu et al., 2016; Sennrich et al., 2017)

. Being an end-to-end architecture, an NMT system first encodes the input sentence into a sequence of real vectors, based on which the decoder generates the target sequence word by word with the attention mechanism

Bahdanau et al. (2014); Luong et al. (2015)

. During training, NMT systems are optimized to maximize the translation probability of a given language pair with the Maximum Likelihood Estimation (MLE) method, which requires large bilingual data to fit the large parameter space. Without adequate data, which is common especially when it comes to a rare language, NMT usually falls short on low-resource language pairs 

Zoph et al. (2016).

In order to deal with the data sparsity problem for NMT, exploiting monolingual data Sennrich et al. (2015); Zhang and Zong (2016); Cheng et al. (2016); Zhang et al. (2018); He et al. (2016) is the most common method. With monolingual data, the back-translation method Sennrich et al. (2015) generates pseudo bilingual sentences with a target-to-source translation model to train the source-to-target one. By extending back-translation, source-to-target and target-to-source translation models can be jointly trained and boost each other Cheng et al. (2016); Zhang et al. (2018). Similar to joint training Cheng et al. (2016); Zhang et al. (2018), dual learning He et al. (2016)

designs a reinforcement learning framework to better capitalize on monolingual data and jointly train two models.

Instead of leveraging monolingual data ( or ) to enrich the low-resource bilingual pair , in this paper, we are motivated to introduce another rich language , by which additionally acquired bilingual data and can be exploited to improve the translation performance of . This requirement is easy to satisfy, especially when is a rare language but is not. Under this scenario, can be a rich-resource pair and provide much bilingual data, while would also be a low-resource pair mostly because is rare. For example, in the dataset IWSLT2012, there are only 112.6K bilingual sentence pairs of English-Hebrew, since Hebrew is a rare language. If French is introduced as the third language, we can have another low-resource bilingual data of French-Hebrew (116.3K sentence pairs), and easily-acquired bilingual data of the rich-resource pair English-French.

Figure 1: Triangular architecture for rare language translation. The solid lines mean rich-resource and the dash lines mean low-resource. , and are three different languages.

With the introduced rich language , in this paper, we propose a novel triangular architecture (TA-NMT) to exploit the additional bilingual data of and , in order to get better translation performance on the low-resource pair , as shown in Figure 1. In this architecture, is used for training another translation model to score the translation model of , while is used to provide large bilingual data with favorable alignment information.

Under the motivation to exploit the rich-resource pair , instead of modeling directly, our method starts from modeling the translation task while taking as a latent variable. Then, we decompose into two phases for training two translation models of low-resource pairs ( and ) respectively. The first translation model generates a sequence in the hidden space of from , based on which the second one generates the translation in

. These two models can be optimized jointly with an Expectation Maximization (EM) framework with the goal of maximizing the translation probability

. In this framework, the two models can boost each other by generating pseudo bilingual data for model training with the weights scored from the other. By reversing the translation direction of , our method can be used to train another two translation models and . Therefore, the four translation models (, , and ) of the rare language can be optimized jointly with our proposed unified bidirectional EM algorithm.

Experimental results on the MultiUN and IWSLT2012 datasets demonstrate that our method can achieve significant improvements for rare languages translation. By incorporating back-translation (a method leveraging more monolingual data) into our method, TA-NMT can achieve even further improvements.

Our contributions are listed as follows:

  • We propose a novel triangular training architecture (TA-NMT) to effectively tackle the data sparsity problem for rare languages in NMT with an EM framework.

  • Our method can exploit two additional bilingual datasets at both the model and data levels by introducing another rich language.

  • Our method is a unified bidirectional EM algorithm, in which four translation models on two low-resource pairs are trained jointly and boost each other.

2 Method

As shown in  Figure 1, our method tries to leverage (a rich-resource pair) and to improve the translation performance of low-resource pair , during which translation models of and can be improved jointly.

Instead of directly modeling the translation probabilities of low-resource pairs, we model the rich-resource pair translation , with the language acting as a bridge to connect and . We decompose into two phases for training two translation models. The first model generates the latent translation in from the input sentence in , based on which the second one generate the final translation in language . Following the standard EM procedure Borman (2004) and Jensen’s inequality, we derive the lower bound of over the whole training data as follows:

(1)

where is the model parameters set of and , and is an arbitrary posterior distribution of . We denote the lower-bound in the last but one line as . Note that we use an approximation that due to the semantic equivalence of parallel sentences and .

In the following subsections, we will first propose our EM method in subsection 2.1 based on the lower-bound derived above. Next, we will extend our method to two directions and give our unified bidirectional EM training in subsection 2.2. Then, in subsection 2.3, we will discuss more training details of our method and present our algorithm in the form of pseudo codes.

2.1 EM Training

To maximize , the EM algorithm can be leveraged to maximize its lower bound . In the E-step, we calculate the expectation of the variable using current estimate for the model, namely find the posterior distribution . In the M-step, with the expectation , we maximize the lower bound . Note that conditioned on the observed data and current model, the calculation of is intractable, so we choose approximately.

M-step: In the M-step, we maximize the lower bound w.r.t model parameters given . By substituting into , we can get the M-step as follows:

(2)

E-step: The approximate choice of brings in a gap between and , which can be minimized in the E-step with Generalized EM method McLachlan and Krishnan (2007). According to bishop2006pattern, we can write this gap explicitly as follows:

(3)

where

is the Kullback–Leibler divergence, and the approximation that

is also used above.

In the E-step, we minimize the gap between and as follows:

(4)

To sum it up, the E-step optimizes the model by minimizing the gap between and to get a better lower bound . This lower bound is then maximized in the M-step to optimize the model . Given the new model , the E-step tries to optimize again to find a new lower bound, with which the M-step is re-performed. This iteration process continues until the models converge, which is guaranteed by the convergence of the EM algorithm.

2.2 Unified Bidirectional Training

The model is used as an approximation of in the E-step optimization (Equation 3). Due to the low resource property of the language pair , cannot be well trained. To solve this problem, we can jointly optimize and similarly by maximizing the reverse translation probability .

We now give our unified bidirectional generalized EM procedures as follows:

  • Direction of

    E: Optimize .

    (5)

    M: Optimize .

    (6)
  • Direction of

    E: Optimize .

    (7)

    M: Optimize .

    (8)

Based on the above derivation, the whole architecture of our method can be illustrated in  Figure 2, where the dash arrows denote the direction of , in which and are trained jointly with the help of , while the solid ones denote the direction of , in which and are trained jointly with the help of .

Figure 2: Triangular Learning Architecture for Low-Resource NMT

2.3 Training Details

A major difficulty in our unified bidirectional training is the exponential search space of the translation candidates, which could be addressed by either sampling Shen et al. (2015); Cheng et al. (2016) or mode approximation Kim and Rush (2016). In our experiments, we leverage the sampling method and simply generate the top target sentence for approximation.

In order to perform gradient descend training, the parameter gradients for Equations  5 and  7 are formulated as follows:

(9)

Similar to reinforcement learning, models and are trained using samples generated by the models themselves. According to our observation, some samples are noisy and detrimental to the training process. One way to tackle this is to filter out the bad ones using some additional metrics (BLEU, etc.). Nevertheless, in our settings, BLEU scores cannot be calculated during training due to the absence of the golden targets ( is generated based on or from the rich-resource pair ). Therefore we choose IBM model1 scores to weight the generated translation candidates, with the word translation probabilities calculated based on the given bilingual data (the low-resource pair or ). Additionally, to stabilize the training process, the pseudo samples generated by model or are mixed with true bilingual samples in the same mini-batch with the ratio of 1-1. The whole training procedure is described in the following Algorithm  1, where the 5th and 9th steps are generating pseudo data.

1:Rich-resource bilingual data ; low-resource bilingual data and
2:Parameters , , and
3:Pre-train , , ,
4:while not convergence do
5:     Sample
6:      : Optimize and
7:     Generate from and build the training batches ,
8:     E-step: update with (Equation 5)
9:     M-step: update with (Equation 6)
10:      : Optimize and
11:     Generate from and build the training batches ,
12:     E-step: update with (Equation 7)
13:     M-step: update with (Equation 8)
14:end while
15:return , , and
Algorithm 1 Training low-resource translation models with the triangular architecture

3 Experiments

3.1 Datasets

In order to verify our method, we conduct experiments on two multilingual datasets. The one is MultiUN Eisele and Chen (2010), which is a collection of translated documents from the United Nations, and the other is IWSLT2012 Cettolo et al. (2012), which is a set of multilingual transcriptions of TED talks. As is mentioned in  section 1, our method is compatible with methods exploiting monolingual data. So we also find some extra monolingual data of rare languages in both datasets and conduct experiments incorporating back-translation into our method.

MultiUN: English-French (EN-FR) bilingual data are used as the rich-resource pair . Arabic (AR) and Spanish (ES) are used as two simulated rare languages . We randomly choose subsets of bilingual data of and in the original dataset to simulate low-resource situations, and make sure there is no overlap in between chosen data of and .

IWSLT2012111https://wit3.fbk.eu/mt.php?release=2012-02-plain: English-French is used as the rich-resource pair , and two rare languages are Hebrew (HE) and Romanian (RO) in our choice. Note that in this dataset, low-resource pairs and are severely overlapped in . In addition, English-French bilingual data from WMT2014 dataset are also used to enrich the rich-resource pair. We also use additional English-Romanian bilingual data from Europarlv7 dataset  Koehn (2005). The monolingual data of (HE and RO) are taken from the web222https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus.

In both datasets, all sentences are filtered within the length of 5 to 50 after tokenization. Both the validation and the test sets are 2,000 parallel sentences sampled from the bilingual data, with the left as training data. The size of training data of all language pairs are shown in  Table 1.

Pair MultiUN IWSLT2012
Lang Size Lang Size
EN-FR 9.9 M EN-FR 333together with WMT2014 7.9 M
EN-AR 116 K EN-HE 112.6 K
FR-AR 116 K FR-HE 116.3 K
mono AR 3 M HE 512.5 K
EN-ES 116 K EN-RO 444together with Europarlv7 467.3 K
FR-ES 116 K FR-RO 111.6 K
mono ES 3 M RO 885.0 K
Table 1: training data size of each language pair.

3.2 Baselines

We compare our method with four baseline systems. The first baseline is the  RNNSearch model Bahdanau et al. (2014), which is a sequence-to-sequence model with attention mechanism trained with given small-scale bilingual data. The trained translation models are also used as pre-trained models for our subsequent training processes.

The second baseline is  PBSMT Koehn et al. (2003), which is a phrase-based statistical machine translation system. PBSMT is known to perform well on low-resource language pairs, so we want to compare it with our proposed method. And we use the public available implementation of Moses555http://www.statmt.org/moses/ for training and test in our experiments.

The third baseline is a teacher-student alike method Chen et al. (2017). For the sake of brevity, we will denote it as  T-S. The process is illustrated in  Figure 3. We treat this method as a second baseline because it can also be regarded as a method exploiting and to improve the translation of if we regard as the zero-resource pair and as the teacher model when training and .

The fourth baseline is back-translation  Sennrich et al. (2015). We will denote it as BackTrans. More concretely, to train the model , we use extra monolingual described in  Table 1 to do back-translation; to train the model , we use monolingual taken from . Procedures for training and are similar. This method use extra monolingual data of compared with our TA-NMT method. But we can incorporate it into our method.

Figure 3: A teacher-student alike method for low-resource translation. For training and , we mix the true pair with the pseudo pair generated by teacher model in the same mini-batch. The training procedure of and is similar.

3.3 Overall Results

Experimental results on both datasets are shown in  Table 3 and  4 respectively, in which RNNSearch, PBSMT, T-S and BackTrans are four baselines. TA-NMT is our proposed method, and TA-NMT(GI) is our method incorporating back-translation as good initialization. For the purpose of clarity and a fair comparison, we list the resources that different methods exploit in Table 2.

Method Resources
PBSMT ,
RNNSearch ,
T-S , ,
BackTrans , , , mono
TA-NMT , ,
TA-NMT(GI) , , , mono
Table 2: Resources that different methods use
Method EN2AR AR2EN FR2AR AR2FR Ave EN2ES ES2EN FR2ES ES2FR Ave
(XZ) (ZX) (YZ) (ZY) (XZ) (ZX) (YZ) (ZY)
RNNSearch 18.03 31.40 13.42 22.04 21.22 38.77 36.51 32.92 33.05 35.31
PBSMT 19.44 30.81 15.27 23.65 22.29 38.47 36.64 34.99 33.98 36.02
T-S 19.02 32.47 14.59 23.53 22.40 39.75 38.02 33.67 34.04 36.57
BackTrans 22.19 32.02 15.85 23.57 23.73 42.27 38.42 35.81 34.25 37.76
TA-NMT 20.59 33.22 14.64 24.45 23.23 40.85 39.06 34.52 34.39 37.21
TA-NMT(GI) 23.16 33.64 16.50 25.07 24.59 42.63 39.53 35.87 35.21 38.31
Table 3: Test BLEU on MultiUN Dataset.
Method EN2HE HE2EN FR2HE HE2FR Ave EN2RO RO2EN FR2RO RO2FR Ave
(XZ) (ZX) (YZ) (ZY) (XZ) (ZX) (YZ) (ZY)
RNNSearch 17.94 28.32 11.86 21.67 19.95 31.44 40.63 17.34 25.20 28.65
PBSMT 17.39 28.05 12.77 21.87 20.02 31.51 39.98 18.13 25.47 28.77
T-S 17.97 28.42 12.04 21.99 20.11 31.80 40.86 17.94 25.69 29.07
BackTrans 18.69 28.55 12.31 21.63 20.20 32.18 41.03 18.19 25.30 29.18
TA-NMT 19.19 29.28 12.76 22.62 20.96 33.65 41.93 18.53 26.35 30.12
TA-NMT(GI) 19.90 29.94 13.54 23.25 21.66 34.41 42.61 19.30 26.53 30.71
Table 4: Test BLEU on IWSLT Dataset.

From  Table 3 on MultiUN, the performance of RNNSearch is relatively poor. As is expected, PBSMT performs better than RNNSearch on low-resource pairs by the average of 1.78 BLEU. The T-S method which can doubling the training data for both and by generating pseudo data from each other, leads up to 1.1 BLEU points improvement on average over RNNSearch. Compared with T-S, our method gains a further improvement of about 0.9 BLEU on average, because our method can better leverage the rich-resource pair . With extra large monolingual introduced, BackTrans can improve the performance of and significantly compared with all the methods without monolingual . However TA-NMT is comparable with or even better than BackTrans for and because both of the methods leverage resources from rich-resource pair , but BackTrans does not use the alignment information it provides. Moreover, with back-translation as good initialization, further improvement is achieved by TA-NMT(GI) of about 0.7 BLEU on average over BackTrans.

In  Table 4, we can draw the similar conclusion. However, different from MultiUN, in the EN-FR-HE group of IWSLT, and are severely overlapped in . Therefore, T-S cannot improve the performance obviously (only about 0.2 BLEU) on RNNSearch because it fails to essentially double training data via the teacher model. As for EN-FR-RO, with the additionally introduced EN-RO data from Europarlv7, which has no overlap in RO with FR-RO, T-S can improve the average performance more than the EN-FR-HE group. TA-NMT outperforms T-S by 0.93 BLEU on average. Note that even though BackTrans uses extra monolingual , the improvements are not so obvious as the former dataset, the reason for which we will delve into in the next subsection. Again, with back-translation as good initialization, TA-NMT(GI) can get the best result.

Note that BLEU scores of TA-NMT are lower than BackTrans in the directions of XZ and YZ. The reason is that the resources used by these two methods are different, as shown in Table 2. To do back translation in two directions (e.g., XZ and ZX), we need monolingual data from both sides (e.g., X and Z), however, in TA-NMT, the monolingual data of Z is not necessary. Therefore, in the translation of XZ or YZ, BackTrans uses additional monolingual data of Z while TA-NMT does not, that is why BackTrans outperforms TA-NMT in these directions. Our method can leverage back translation as a good initialization, aka TA-NMT(GI) , and outperforms BackTrans on all translation directions.

The average test BLEU scores of different methods in each data group (EN-FR-AR, EN-FR-ES, EN-FR-HE, and EN-FR-RO) are listed in the column Ave of the tables for clear comparison.

3.4 The Effect of Extra Monolingual Data

Comparing the results of BackTrans and TA-NMT(GI) on both datasets, we notice the improvements of both methods on IWSLT are not as significant as MultiUN. We speculate the reason is the relatively less amount of monolingual we use in the experiments on IWSLT as shown in Table 1. So we conduct the following experiment to verify the conjecture by changing the scale of monolingual Arabic data in the MultiUN dataset, of which the data utilization rates are set to 0%, 10%, 30%, 60% and 100% respectively. Then we compare the performance of BackTrans and TA-NMT(GI) in the EN-FR-AR group. As Figure 4 shows, the amount of monolingual actually has a big effect on the results, which can also verify our conjecture above upon the less significant improvement of BackTrans and TA-NMT(GI) on IWSLT. In addition, even with poor ”good-initialization”, TA-NMT(GI) still get the best results.

Figure 4: Test BLEU of the EN-FR-AR group performed by BackTrans and TA-NMT(GI) with different amount of monolingual Arabic data.

3.5 EM Training Curves

To better illustrate the behavior of our method, we print the training curves in both the M-steps and E-steps of TA-NMT and TA-NMT(GI) in  Figure 5 above. The chosen models printed in this figure are EN2AR and AR2FR on MultiUN, and EN2RO and RO2FR on IWLST.

Figure 5: BLEU curves on validation sets during the training processes of TA-NMT and TA-NMT(GI). (Top: EN2AR (the E-step) and AR2FR (the M-step); Bottom: EN2RO (the E-step) and RO2FR (the M-step))

From  Figure 5, we can see that the two low-resource translation models are improved nearly simultaneously along with the training process, which verifies our point that two weak models could boost each other in our EM framework. Notice that at the early stage, the performance of all models stagnates for several iterations, especially of TA-NMT. The reason could be that the pseudo bilingual data and the true training data are heterogeneous, and it may take some time for the models to adapt to a new distribution which both models agree. Compared with TA-NMT, TA-NMT(GI) are more stable, because the models may have adapted to a mixed distribution of heterogeneous data in the preceding back-translation phase.

3.6 Reinforcement Learning Mechanism in Our Method

Source in concluding , poverty eradication requires political will and commitment .
Output en (0.66) conclusión (0.80) , (0.14) la (0.00) erradicación (1.00) de (0.40) la (0.00) pobreza
(0.90) requiere (0.10) voluntad (1.00) y (0.46) compromiso (0.90) políticas (-0.01) . (1.00)
Reference en conclusión , la erradicación de la pobreza necesita la voluntad y compromiso políticos .
Source visit us and get to know and love berlin !
Output visita (0.00) y (0.05) se (0.00) a (0.17) saber (0.00) y (0.04) a (0.01) berlín (0.00) ! (0.00)
Reference visítanos y llegar a saber y amar a berlín .
Source legislation also provides an important means of recognizing economic , social and cultural
rights at the domestic level .
Output la (1.00) legislación (0.34) también (1.00) constituye (0.60) un (1.00) medio (0.22) importante
(0.74) de (0.63) reconocer (0.21) los (0.01) derechos (0.01) económicos (0.03) , (0.01) sociales
(0.02) y (0.01) culturales (1.00) a (0.00) nivel (0.40) nacional (1.00) . (0.03)
Reference la legislación también constituye un medio importante de reconocer los derechos económicos ,
iales y culturales a nivel nacional .
Table 5: English to Spanish translation sampled in the E-step as well as its time-step rewards.

As shown in Equation 9, the E-step actually works as a reinforcement learning (RL) mechanism. Models and generate samples by themselves and receive rewards to update their parameters. Note that the “reward” here is described by the log terms in Equation 9, which is derived from our EM algorithm rather than defined artificially. In  Table 5, we do a case study of the EN2ES translation sampled by as well as its time-step rewards during the E-step.

In the first case, the best translation of ”political” is ”políticos”. When the model generates an inaccurate one ”políticas”, it receives a negative reward (-0.01), with which the model parameters will be updated accordingly. In the second case, the output misses important words and is not fluent. Rewards received by the model are zero for nearly all tokens in the output, leading to an invalid updating. In the last case, the output sentence is identical to the human reference. The rewards received are nearly all positive and meaningful, thus the RL rule will update the parameters to encourage this translation candidate.

4 Related Work

NMT systems, relying heavily on the availability of large bilingual data, result in poor translation quality for low-resource pairs Zoph et al. (2016). This low-resource phenomenon has been observed in much preceding work. A very common approach is exploiting monolingual data of both source and target languages Sennrich et al. (2015); Zhang and Zong (2016); Cheng et al. (2016); Zhang et al. (2018); He et al. (2016).

As a kind of data augmentation technique, exploiting monolingual data can enrich the training data for low-resource pairs. sennrich2015improving propose back-translation, exploits the monolingual data of the target side, which is then used to generate pseudo bilingual data via an additional target-to-source translation model. Different from back-translation, zhang2016exploiting propose two approaches to use source-side monolingual data, of which the first is employing a self-learning algorithm to generate pseudo data, while the second is using two NMT models to predict the translation and to reorder the source-side monolingual sentences. As an extension to these two methods, cheng2016semi and zhang2017joint combine two translation directions and propose a training framework to jointly optimize the source-to-target and target-to-source translation models. Similar to joint training, he2016dual propose a dual learning framework with a reinforcement learning mechanism to better leverage monolingual data and make two translation models promote each other. All of these methods are concentrated on exploiting either the monolingual data of the source and target language or both of them.

Our method takes a different angle but is compatible with existing approaches, we propose a novel triangular architecture to leverage two additional language pairs by introducing a third rich language. By combining our method with existing approaches such as back-translation, we can make a further improvement.

Another approach for tackling the low-resource translation problem is multilingual neural machine translation Firat et al. (2016), where different encoders and decoders for all languages with a shared attention mechanism are trained. This method tends to exploit the network architecture to relate low-resource pairs. Our method is different from it, which is more like a training method rather than network modification.

5 Conclusion

In this paper, we propose a triangular architecture (TA-NMT) to effectively tackle the problem of low-resource pairs translation with a unified bidirectional EM framework. By introducing another rich language, our method can better exploit the additional language pairs to enrich the original low-resource pair. Compared with the RNNSearch Bahdanau et al. (2014), a teacher-student alike method Chen et al. (2017) and the back-translation Sennrich et al. (2015) on the same data level, our method achieves significant improvement on the MutiUN and IWSLT2012 datasets. Note that our method can be combined with methods exploiting monolingual data for NMT low-resource problem such as back-translation and make further improvements.

In the future, we may extend our architecture to other scenarios, such as totally unsupervised training with no bilingual data for the rare language.

Acknowledgments

We thank Zhirui Zhang and Shuangzhi Wu for useful discussions. This work is supported in part by NSFC U1636210, 973 Program 2014CB340300, and NSFC 61421003.

References

Appendix A Implementation details

All the NMT systems we used are implemented as the classic attention-based encoder-decoder framework with a bidirectional RNN encoder Bahdanau et al. (2014). The embedding size of both source and target words is 256, and hidden units of both encoder and decoder are 512-dimensional GRU cells for the MultiUN dataset and 256-dimensional for the IWSLT dataset. The vocabulary size is limited in 50K for each language in the MultiUN dataset and 30K in the IWSLT2012 dataset, with the out-of-vocabulary (OOV) words mapped to a special token

. The parameters are randomly initialized with sampling from the Gaussian distribution

.

We use mini-batch of size 64 with AdaDelta optimizer Zeiler (2012)

for training . The learning rate in pre-training is set to 1.0 (the gradients are normalized), while in subsequent training stages it is set to 0.5. In the pre-training stage, we randomly shuffle the given data and train models for 20 to 30 epochs until converging. In the test time, the beam search method is used for decoding and the beam size is set to 8.