Learn to Use Future Information in Simultaneous Translation

07/10/2020 ∙ by Xueqing Wu, et al. ∙ Microsoft 0

Simultaneous neural machine translation (briefly, NMT) has attracted much attention recently. In contrast to standard NMT, where the NMT system can utilize the full input sentence, simultaneous NMT is formulated as a prefix-to-prefix problem, where the system can only utilize the prefix of the input sentence and more uncertainty is introduced to decoding. Wait-k is a simple yet effective strategy for simultaneous NMT, where the decoder generates the output sequence k words behind the input words. We observed that training simultaneous NMT systems with future information (i.e., trained with a larger k) generally outperforms the standard ones (i.e., trained with the given k). Based on this observation, we propose a framework that automatically learns how much future information to use in training for simultaneous NMT. We first build a series of tasks where each one is associated with a different k, and then learn a model on these tasks guided by a controller. The controller is jointly trained with the translation model through bi-level optimization. We conduct experiments on four datasets to demonstrate the effectiveness of our method.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Simultaneous translation (also known as simultaneous interpretation) is widely used in international conferences, summits and business. Different from standard neural machine translation (NMT) wu2016google ; hassan2018achieving , simultaneous NMT has a stricter requirement for latency. We cannot wait to the end of a source sentence but have to start the translation right after reading the first few words. That is, the translator is required to provide instant translation based on a partial source sentence.

Simultaneous NMT is formulated as a prefix-to-prefix problem ma2019stacl ; xiong2019dutongchuan ; ma2020monotonic , where a prefix refers to a sub-sequence starting from the beginning of the sentence to be translated. In simultaneous NMT, we face more uncertainty than conventional NMT, since the translation starts with a partial source sentence rather than the complete information in conventional NMT. Wait- ma2019stacl is a simple yet effective strategy in simultaneous NMT where the generated translation is words behind the source input. That is, rather than instant translation of each word, wait- actually leverages more future words. Obviously, a larger can leverage more future information, and therefore results in better translation quality but at the cost of a larger latency. Thus, when used in real-world applications, we should have a relatively small for simultaneous NMT.

While only small values are allowed in inference, we observe that training with a larger will lead to better accuracy for wait- inference, as demonstrated in Figure 1, in which a wait- model is required for EnglishGerman translation. If training with , we will obtain a BLEU score. But if we train with wait- where is set as a larger value such as , or and test with wait-, we can get better BLEU scores. Despite the mismatch between training with wait- and testing with wait-, the model can benefit from the availability of more future information. This is consistent with the observation in ma2019stacl .

Figure 1: Preliminary exploration of IWSLT English-to-German simultaneous NMT. -axis represents the waiting threshold during training and -axis represents the BLEU scores testing with wait-3 strategy.

Here, the challenge is how much future information we should use. As shown in Figure 1, using more future information does not monotonically improve the translation accuracy of wait- inference, mainly because that more future information results in a larger gap between training and inference. In this work, we propose a framework that can automatically determine how much future information to use in training for simultaneous NMT. Given a pre-defined for inference, we prepare training tasks wait- with different values (i.e., ). We introduce a controller such that given a training sample, the controller can dynamically select one of these tasks so as to maximize the validation performance on wait-, i.e., the one we are interested in. The task selection is based on the data itself and the network status of the translation model. The controller model and the translation model are jointly learned, where the learning process is formulated as a bi-level optimization problem and we design an effective algorithm to solve it. We conduct experiments on four datasets to verify the effectiveness of our method.

The remaining part is organized as follows. The related work is introduced in Section 2, the problem formulation and background is introduced in Section 3, and our method is introduce in Section 4. The experiments and the analysis are in Section 5, and we discuss the conclusion and future work in Section 6.

2 Related work

We first introduce the related work about simultaneous NMT, and then briefly summarize the work about leveraging future information.

The related work of simultaneous NMT can be categorized by whether using a fixed decoding scheduler or an adaptive one. Wait- is the representative method for fixed scheduler ma2019stacl , where the decoding is always words behind the source input. Although the method is simple, it achieves surprisingly good results in terms of translation quality and controllable latency, and has been extended to speech-related simultaneous translation zhang2019simuls2s ; ren2020simulSpeech . Similar idea exists in dalvi2018incremental which used rule-based schedulers. Among methods that use adaptive schedulers, zheng2020simultaneous leveraged a collection of wait-

models with different waiting thresholds, and designed a heuristic rule to adaptively determine which

wait- to use. Monotonic Infinite Lookback Attention (MILk) leveraged monotonic attention as an end-to-end learnable adaptive scheduler arivazhagan-etal-2019-monotonic . Multihead Monotonic Attention (MMA) extended the idea to multihead attention, and proposed two mechanism: MMA-IL (Infinite Lookback), which has higher translation quality, and MMA-H(ard), which is more computational efficient ma2020monotonic . zheng-etal-2019-simultaneous

applied imitation learning to simultaneous NMT and designed a restricted dynamic oracle.

zheng-etal-2019-simpler proposed another oracle generated by a conventional NMT teacher according to predefined rules.

Action prediction is another typical application of leveraging future information 8543243 ; cai2019action ; 8099873 . The task is, given an action video recorded by a series of frames, we need to predict the action as early as possible (i.e., leveraging partial information only). A common practice of the above work is to first learn a feature on the complete video, and then distill it to the partial information predictor. Leveraging future information is also studied in game AI, like Suphx li2020suphx and AlphaStar vinyals2019grandmaster . The success of the above applications motivates us that leveraging future information has great potential to improve the performances.

3 Problem formulation and background

In this section, we first introduce notations used in this work, followed by the formulation of wait- strategy, and then we introduce our network architecture adapted from ma2019stacl .

3.1 Notations and formulation

Let and denote the source language domain and target language domain. For any and , let and denote the -th token in and respectively. and denote the numbers of tokens in and . Let denote a prefix of , which is the subsequence , and similarly for . Let and denote the training and validation sets, both of which are collections of bilingual sentence pairs.

The wait- strategy ma2019stacl is defined as follows: given an input , the generation of the translation is always tokens behind reading . That is, at the -th decoding step, we generate token based on (more strictly, ). Our goal is to obtain a model with parameter that can achieve better results with wait-.

3.2 Model architecture

Our model for simultaneous NMT is based on Transformer model vaswani2017attention . The model includes an encoder and a decoder, which are used for incrementally processing the source and target sentences respectively. Both the encoder and decoder are stacked of blocks. We mainly introduce the differences compared to the standard Transformer.

(1) Incremental encoding: Let denote the output of the -th position from block . For ease of reference, let denote , and let denote the embedding of the

-th token. An attention model,

, takes a query ( is the dimension of the query), a set of keys and values as inputs. and are of equal size, and and are the -th key and value. attn is defined as follows:


where the ’s are the parameters to be optimized. In the encoder side, the are obtained in a unidirectional way:


That is, the hidden representations can only attend to the previously generated hidden representations, and the computation complexity is

. In comparison, ma2019stacl still leverages bidirectional attention, whose computation complexity is . We find that unidirectional attention is much more efficient than bidirectional attention without much accuracy drop (see Appendix D.1 for details).

(2) Incremental decoding: Since we use wait- strategy, the decoding starts before reading all inputs. At the -th decoding step, the decoder can only read . When , the decoder greedily generates one token at each step, i.e., the token is , where is the vocabulary of the target language. When , the model has read the full input sentence and can generate words using beam search.

4 Our method

We first introduce our algorithm on leveraging future information with a bi-level optimization. Then we discuss its relationship with several other heuristic algorithms that leverage future information.

4.1 Algorithm

We introduce a task controller parameterized by , which adaptively assigns the current input a task wait-, where ,

is a pre-defined hyperparameter. The input of

consists of two parts: (i) information of the data ; (ii) information of the network state (e.g., historical losses and actions, previous actions). For ease of reference, denote these inputs as . We will discuss how to design in Section 5.1.

Denote as the validation metric of the wait- strategy, which is evaluated on data with model . Our idea is formulated as a bi-level optimization problem. That is,


In Eqn.(3), we can see that we learn a translation model under the guidance of the controller . The goal of the controller is to maximize the validation performance using wait-, and it adaptively assigns a wait- task to the input , by which the student model can leverage more information, especially the future information.

We optimize Eqn.(3) in an alternative way, where we first optimize with a given , then update using the REINFORCE algorithm. We repeat the above process until convergence. Details can be found in Algorithm 1.

1 Input: Training episode ; internal update iterations ; batch size ; learning rates of NMT model ; learning rate of the controller ; initial parameters of , ;
2 for  do
3       ; ;
4       for  do
5             Randomly sample a mini-batch of data from ; assign each data a wait- task: ; ;
6             ;
8      Calculate the validation performance: ;
9       Update the controller by .
Return .
Algorithm 1 The optimization algorithm.

Algorithm 1 will be executed for episodes (i.e., the outer loop), and each episode consists of a -step inner loop. The inner loop (from line 4 to line 6) aims to optimize where we can choose the update the parameter with any gradient based algorithm like momentum SGD, Adam adam_optimizer , etc. The outer loop aims to optimize . can be regarded as a policy network, where the state is , the action is the choice of the task wait-, , and the reward is the validation performance (line 7). At the end of each episode, we update using REINFORCE algorithm (line 8).

4.2 Discussion

Under our framework, where we adaptively assign a task wait- to the input, there are some heuristic approaches:

(1) Random sampling (briefly, Random): When an input comes, randomly choose from

with equal probability;

(2) Curriculum learning (briefly, CL): We gradually decrease from to the threshold we will use in the test setting. There are several ways to decrease the , e.g., ladder-like, conic, logarithmic, etc (see Appendix A.2 for mathematical definition).

There are some limitations for the above two strategies. For Random, it always explores all possible ’s, even if some wait- is certainly not a good strategy. For CL, we need to manually design when to decrease , which is a challenging problem. We overcome the above two problems by introducing a controller, which adaptively determines how much exploration the model requires and how long we should use a specific wait- strategy.

5 Experiments

We work on the text-to-text simultaneous NMT in this paper and leave the speech-to-speech version in the future. We conduct experiments on three small-scale IWSLT datasets: IWSLT’14 EnglishGerman, IWSLT’15 EnglishVietnamese and IWSLT’17 EnglishChinese, and a large-scale dataset: WMT’15 EnglishGerman translation. Briefly denote English, German, Vietnamese and Chinese as En, De, Vi and Zh respectively.

5.1 Settings

Datasets: For IWSLT’14 EnDe, following edunov-etal-2018-classical , we lowercase all words, tokenize them and apply BPE sennrich-etal-2016-neural with merge operations jointly to the source and target sequences. We split sentences from the training corpus for validation and the remaining sequences are left as the training set. The test set is the concatenation of tst2010, tst2011, tst2012, dev2010 and dev2012, which consists of sentences. For IWSLT’15 EnVi, following ma2020monotonic , we use tst2012 as validation set and tst2013 as test set. For IWSLT’17 EnZh, we tokenize the data and apply BPE with merge operations independently to the source and target sequences. We concatenate tst2013, tst2014 and tst2015 as the validation set and use tst2017 as test set. For WMT’15 EnDe, following ma2019stacl ; arivazhagan-etal-2019-monotonic , we tokenize the data, apply BPE with merge operations jointly to the source and target sentences, and get a training corpus with sentences. We use newstest2013 as validation set and use newstest2015 as the test set. More details about datasets can be found at Appendix B.

Models: The translation model is based on Transformer. For IWSLT EnZh and EnVi, we use the transformer small model, where the embedding dimension, feed-forward layer dimension, number of layers are , and respectively. For IWSLT EnDe, we use the same architecture but change the embedding dimension into . For WMT’15 EnDe, we use the transformer big setting, where the above three numbers are , and respectively. The controller

for each task is a multilayer perceptron (MLP) with one hidden layer and the

tanhactivation function. The size of the hidden layer is .

Input features of : The input is a

-dimension vector containing: (1) the ratios between the lengths of the source/target sentences to the average source/target sentence lengths in all training data (

dimensions); (2) the training loss evaluated by wait-

; (3) the average of historical training losses; (4) the validation loss of the previous epoch; (5) the average of historical validation loss; (6) the ratio of current training step to total training iteration.

Training strategy: For the translation model, we use Adam adam_optimizer optimizer with initial learning rate and inverse_sqrt scheduler (see Section 5.3 of vaswani2017attention for details). The batch size and the number of GPUs of IWSLT EnDe, EnVi and WMT’15 EnDe are GPU, GPU and GPU respectively. For IWSLT tasks, the learning rate is grid searched from with vanilla SGD optimizer, and the internal update iteration is grid searched from , where is the number of updates in an epoch of the student model training. For WMT’15 EnDe, the student model is warm started from pretrained wait- model, the learning rate is set as , and the internal update iteration is .

The validation performance is the inverse of the validation loss with wait- strategy. To stabilize the training, we minus a baseline to the in Line 7 of algorithm 1. The baseline is the validation loss of the previous episode, i.e., , and is the inverse validation loss of the randomly initialized model. That is, the validation signal at episode is .

Baselines: We implement the Random and CL in Section 4.2 as baselines. We design another baseline where we train all the wait- strategies, , select the best model according to validation performance and use it for test time wait-. The waiting threshold for training the best model is denoted as , and this baseline is denoted as wait-.

Evaluation: We use BLEU scores to measure the translation quality, and use Average Proportion (AP) and Average Lagging (AL) to evaluate translation delay. Specifically, AP measures the average proportion of source symbols required for translation, and it is unfair between long sentences and short sentences; AL measures the average number of delayed words and overcomes the shortcomings of AP (see Appendix A.1 for details). Following the common practice ma2019stacl ; ma2020monotonic , we show the BLEU-AP and BLEU-AL curves to demonstrate the tradeoff between quality and latency. For IWSLT’14 EnDe and IWSLT’15 EnVi, we use multi-bleu.perl to evaluate the BLEU scores; for IWSLT’17 EnZh and WMT’15 EnDe, we use sacreBLEU to evaluate the detokenized BLEU scores. We use the scripts provided by ma2019stacl to evaluate AP and AL scores.

5.2 Results

We first compare our method with the baseline methods on IWSLT datasets. The BLEU-latency curves are shown in Figure 2, and we report the BLEU scores of EnVi under different test-time wait- as shown in Table 1. The BLEU scores of all languages are left in Appendix C.

(a) BLEU-AP, EnDe.
(b) BLEU-AP, EnVi.
(c) BLEU-AP, EnZh.
(d) BLEU-AL, EnDe.
(e) BLEU-AL, EnVi.
(f) BLEU-AL, EnZh.
Figure 2: Translation quality against latency metrics (AP and AL) on IWSLT’14 EnDe, IWSLT’15 EnVi and IWSLT’17 EnZh tasks .
Test wait- wait-/ CL Random Ours
Table 1: BLEU scores on IWSLT EnVi simultaneous NMT tasks.

We have the following observations:

(1) Generally, our method consistently performed the best across different translation tasks in terms of both translation quality and controllable latency. As shown in Table 1, our method achieves the highest BLEU scores among all baselines. In Figure 2, the curve for our method (i.e., the red one) is on the top in most cases, which indicates that given specific latency (e.g., AP or AL), we can achieve the best translation quality.

(2) Baselines like wait-, Random and CL can also outperform the vanilla wait-, which demonstrates the effectiveness of leveraging future information. However, they are not consistent on which one is better. For example, in Figure 2, wait-  performed best on EnVi but not good on EnDe. CL performs similarly to Random on EnDe dataset, but much better than Random on EnVi and especially on EnZh dataset. In comparison, with our method which is guided by a controller, the improvement is much more consistent.

(3) The improvement brought by our method is more significant with smaller ’s than that with bigger ’s. We observe that all baselines perform well with bigger , where more information is available during inference. That is, the advantages of leveraging future information are less significant.

We further compare our method with MILk arivazhagan-etal-2019-monotonic , MMA-IL ma2020monotonic and MMA-H ma2020monotonic on IWSLT EnVi. BLEU-AL curves are shown in Figure 4 and the BLEU-AP curves are in Appendix C. When AL , our method outperforms all baseline models. When AL , our method performs slightly worse than MMA-IL and MMA-H. We will combine our method with MMA-IL and MMA-H in the future.

Figure 3: BLEU-AL Comparison between our method and baselines on IWSLT’15 EnVi.
Figure 4: Translation quality against latency metrics (AP and AL) on WMT’15 EnDe.

The results of WMT’15 EnDe, whose training corpus is larger, are shown in Figure 4. Due to resources limitation, we only compared our method with wait- and wait- baselines. Our method consistently outperforms wait- baseline and wait- baseline, which demonstrates that our method also improves the performance on large datasets.

5.3 Analysis

(I) Strategy analysis: We visualize our learned strategies for EnZh wait- and wait-9 translation in Figure 5. We show the frequency on the wait- strategy that the teacher model outputs at the th, st, th, th and th episode.

Figure 5: An illustration of the strategies for wait- and wait- on EnZh dataset.

We observed that the controller uniformly samples different at first, and then the strategies converge within episodes. After convergence, the controller mainly samples several specific actions, i.e., for wait-, and for wait-. The action that both controllers prefer most is , which is close to the wait- strategy () for both wait- and wait-. Generally, these two strategies assign most of the sampling frequency to large , which again shows the importance of using future information. However, it is worth noting that the controller also samples smaller , which means that the past information is also utilized. For example, the controller for wait- still samples with a probability about . Our conjecture is that the use of past information helps mitigate the mismatch between training and testing. If the model is always trained with future information, this mismatch will be large.

(II) Action space selection: In previous experiments, both future information and past information are leveraged. That is, given a strategy wait-, the controller can sample a strategy wait- where or . We want to study the effect of using past information or future information only. For any wait-, we build another two action spaces for : ; . We evaluate wait- on IWSLT’14 EnDe with the above two action spaces. The results are reported in Table 2.

We observe that our method with full action space significantly outperforms that with and slightly outperforms that using . This shows that leveraging both kinds of information is helpful to improve the performances.

Full action space
Table 2:

Ablation study for feature selection on IWSLT’14 En

De dataset.

(III) Feature selection: To emphasize the importance of the selected features in Section 5.1, we provide four groups of ablation study, where in each group some specific features are excluded: (i) source and target sentence lengths; (ii) current training loss and average historical training loss; (iii) current validation loss and average historical validation loss; (iv) training step. We work on IWSLT’14 EnDe task and study the effect to wait-,

The results are shown in Table 3. We report the BLEU scores only, since the latency metrics (AP and AL) are not significantly influenced. Removing any feature causes the performance drop, indicating that they all contribute to the decision making. Specifically, information of the network state (i.e., feature group (iii) and (iv)) is more important to the decision making compared to the information of the input data (i.e., feature group (i) and (ii)).

Ours 23.91 26.27 26.97
- (i) 23.67 (-1.00%, rank 3) 26.03 (-0.91%, rank 3) 26.92 (-0.19%, rank 4)
- (ii) 23.70 (-0.88%, rank 4) 26.04 (-0.88%, rank 4) 26.91 (-0.22%, rank 3)
- (iii) 23.57 (-1.42%, rank 1) 25.92 (-1.33%, rank 2) 26.72 (-0.93%, rank 1)
- (iv) 23.65 (-1.09%, rank 2) 25.63 (-2.44%, rank 1) 26.86 (-0.41%, rank 2)
Table 3: Ablation study for feature selection on IWSLT’14 EnDe dataset.

6 Conclusion and future work

In this work, we propose a new approach for simultaneous NMT. Motivated by the fact that wait- benefits from future information, we introduce a controller, which adaptively assigns a task wait- to the input. A bi-level optimization method is leveraged to jointly obtain the translation model and the controller. Experiments on four translation tasks demonstrate the effectiveness of our approach.

For future work, there are many interesting directions. First, we will enhance the objective function in Eqn.(3) beyond using translation quality only and explicitly introduce the latency constraint. Second, we will combine our method with the adaptive decoding methods arivazhagan-etal-2019-monotonic ; ma2020monotonic . Third, we will apply the idea in this work to more applications like action prediction, weather forecasting, game AI, etc.


Appendix A Mathematical definitions

a.1 Latency metrics definitions

Given the input sentence and the output sentence , let and denote the length of and respectively. Define a function of decoding step , which denotes the number of source tokens processed by the encoder when deciding the target token . For wait- strategy, . The definition of Average Proportion (AP) and Average Lagging (AL) are listed in Eqn.(4) and Eqn.(5).


a.2 Mathematical formulation of curriculum learning

In the curriculum learning (briefly, CL) baseline, we gradually decrease from to the threshold which will used in the test setting. There are several ways to decrease the , including ladder-like, conic, logarithmic. The mathematical formulations are shown as follows:

Ladder-like: (6)
Conic: (7)
Logarithmic: (8)

where denotes the total update number, denotes the current update number (), and is a predefined hyperparameter to control the shape of the - curve. The - curves are shown in Figure 6. We use ladder-like CL in our experiments with .

Figure 6: The - curves for ladder-like, conic and logarithmic () CL. .

Appendix B Detailed introduction of the datasets

For IWSLT’14 EnDe, following edunov-etal-2018-classical , we lowercase all words, tokenize them and apply BPE with merge operations jointly to the source and target sequences. We split sentences from the training corpus for validation and the remaining sequences are left as the training set. The test set is the concatenation of tst2010, tst2011, tst2012, dev2010 and dev2012, which consists of sentences.

For IWSLT’15 EnVi, following ma2020monotonic , we tokenize the data and replace words with frequency less than 5 by <unk>111The data is downloaded from https://nlp.stanford.edu/projects/nmt/, which has been tokenized.. We use tst2012 as validation set and tst2013 as test set. The training, validation and test sets contains , and sentences respectively.

For IWSLT’17 EnZh, we tokenize the data and apply BPE with merge operations independently to the source and target sequences222The Chinese sentences are tokenized using Jieba ( https://github.com/fxsjy/jieba ).. We use the concatenation of tst2013, tst2014 and tst2015 as validation set and use tst2017 as test set. The training, validation and test sets contains , and sentences respectively. For WMT’15 EnDe, we follow the setting in ma2019stacl ; arivazhagan-etal-2019-monotonic . We tokenize the data, apply BPE with merge operations jointly to the source and target sentences, and get a training corpus with sentences. We use newstest2013 as validation set and use newstest2015 as the test set.

Appendix C Supplemental results

In this section, we report the specific BLEU scores of our methods and the baselines. The BLEU scores for IWSLT tasks are reported in Table 4, and the BLEU scores for WMT EnDe are reported in Table 5. We also report the BLEU-AL curves of our methods and baselines on IWSLT’15 EnVi in Figure 7.

Task wait- wait-/ best CL Random Ours
EnDe () /
EnDe () /
EnDe () /
EnDe () /
EnDe () /
EnVi () /
EnVi () /
EnVi () /
EnVi () /
EnVi () /
EnZh () /
EnZh () /
EnZh () /
EnZh () /
EnZh () /
Table 4: BLEU scores on IWSLT simultaneous NMT tasks.
wait- wait-/ Ours
Table 5: Results on WMT EnDe dataset.
Figure 7: BLEU-AP. Comparison between our method and baselines on IWSLT’15 EnVi.

Appendix D Additional ablations and analysis

d.1 Model architecture selection

As mentioned in Section 3 of the main content,we adopt unidirectional attention instead of bidirectional attention in the encoder side. We compare the performance the wait- model with two attention types on IWSLT’14 EnDe dataset, and the results are in Figure 8(a) and Figure 8(b). We also compare our results on WMT’15 EnDe with the results of bidirectional attention models reported by ma2019stacl , and the results are shown in Figure 8(c). On IWSLT’14, we observe that the performance of wait- with unidirectional attention slightly drops than that with bidirectional attention. On WMT’15 EnDe dataset, our implementation of wait- with unidirectional attention is slightly better than that of bidirectional attention reported in ma2019stacl . However, the computational cost of bidirectional attention is much larger than unidirectional attention. For example, the inference speed of unidirectional wait- model is sentences / second, while the inference speed of bidirectional attention is sentences / second.

(a) BLEU-AP, IWSLT’14 EnDe
(b) BLEU-AL, IWSLT’14 EnDe
(c) BLEU-AL, WMT’15 EnDe
Figure 8: Ablation study of different model architectures on IWSLT’14 EnDe dataset and WMT’15 EnDe dataset.

d.2 Case study

To analyze the effect of using future information, we present two translation examples for EnZh wait-3 translation in Table 6 and Table 7. We observe that all methods tend to anticipate when the future information is lacking (Table 6). Wait-3 makes more mistake (Table 6) and even makes wrong anticipation where there is no need to anticipate (Table 7), while wait- and Ours anticipate more appropriately (Table 6). However, as in Table 7, wait- sometimes generates repeated information, therefore increasing the overall latency. This might be resulted from the gap between training and testing, as wait- is trained to produce higher latency. Our method can leverage the advantages of both methods, and produces translations with the best quality.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
I was born with epi@@ le@@ p@@ sy and an intellectual disability .
wait-3 出生 一个 充满 癫@@ 知识@@ 产@@ 障碍 国家
I was born in a full of epilepsy - Not a word - country .
I was born in a country full of epilepsy Not a word.
Wait- 出生 时候 患有 癫@@ 智力 障碍
I was born - when , I suffered from epilepsy and intellectual disability .
When I was born, I suffered from epilepsy and intellectual disability.
Ours 出生 伴随 癫@@ 智力 障碍
I was born when , with - epilepsy and intellectual disability .
When I was born, I was accompanied by epilepsy and intellectual disability.
Table 6: Example for EnZh wait-3 translation. In this example and the next example, different colors represent different meanings. Specifically, green and red represents information that does not exist in the source sentence (i.e., anticipated by the model), where green represents information that is consistent with the input sentence (i.e. correctly anticipated), and red represents information that is inconsistent with the input sentence (i.e., wrongly anticipated).
At step , Wait-3 anticipates "在一个" ( in a), while wait- and Ours anticipates " 的时候" ( when) and "" ( when) respectively. The anticipation generated by wait- and Ours are more appropriate within the context, while wait-3 makes mistakes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
And I opened up the website , and there was my face staring right back at me .
wait-3 打开 网站 发现
I opened - the website , I found - I POS face .
I opened the website and I found my face.
Wait- 打开 网站 打开 网站 看着
I opened - the website , opened the website , I POS face PROG PROG looking at me .
I opened the website, opened the website, and my face was looking at me.
Ours 打开 网站 然后
I opened - the website , then - there was - I POS face stare PROG me .
I opened the website, and then there was my face staring at me.
Table 7: Example 2 for EnZh wait- translation, where POS indicates possessive forms, and PROG indicates progressive tense. In this example, there is no need to anticipate. However, wait-3 still anticipates " 发现" ( found) and makes a mistake. Wait- makes a mistake by repeating " 打开了网站" ( opened the website). Ours generates the best translation.