Simultaneous translation (also known as simultaneous interpretation) is widely used in international conferences, summits and business. Different from standard neural machine translation (NMT) wu2016google ; hassan2018achieving , simultaneous NMT has a stricter requirement for latency. We cannot wait to the end of a source sentence but have to start the translation right after reading the first few words. That is, the translator is required to provide instant translation based on a partial source sentence.
Simultaneous NMT is formulated as a prefix-to-prefix problem ma2019stacl ; xiong2019dutongchuan ; ma2020monotonic , where a prefix refers to a sub-sequence starting from the beginning of the sentence to be translated. In simultaneous NMT, we face more uncertainty than conventional NMT, since the translation starts with a partial source sentence rather than the complete information in conventional NMT. Wait- ma2019stacl is a simple yet effective strategy in simultaneous NMT where the generated translation is words behind the source input. That is, rather than instant translation of each word, wait- actually leverages more future words. Obviously, a larger can leverage more future information, and therefore results in better translation quality but at the cost of a larger latency. Thus, when used in real-world applications, we should have a relatively small for simultaneous NMT.
While only small values are allowed in inference, we observe that training with a larger will lead to better accuracy for wait- inference, as demonstrated in Figure 1, in which a wait- model is required for EnglishGerman translation. If training with , we will obtain a BLEU score. But if we train with wait- where is set as a larger value such as , or and test with wait-, we can get better BLEU scores. Despite the mismatch between training with wait- and testing with wait-, the model can benefit from the availability of more future information. This is consistent with the observation in ma2019stacl .
Here, the challenge is how much future information we should use. As shown in Figure 1, using more future information does not monotonically improve the translation accuracy of wait- inference, mainly because that more future information results in a larger gap between training and inference. In this work, we propose a framework that can automatically determine how much future information to use in training for simultaneous NMT. Given a pre-defined for inference, we prepare training tasks wait- with different values (i.e., ). We introduce a controller such that given a training sample, the controller can dynamically select one of these tasks so as to maximize the validation performance on wait-, i.e., the one we are interested in. The task selection is based on the data itself and the network status of the translation model. The controller model and the translation model are jointly learned, where the learning process is formulated as a bi-level optimization problem and we design an effective algorithm to solve it. We conduct experiments on four datasets to verify the effectiveness of our method.
The remaining part is organized as follows. The related work is introduced in Section 2, the problem formulation and background is introduced in Section 3, and our method is introduce in Section 4. The experiments and the analysis are in Section 5, and we discuss the conclusion and future work in Section 6.
2 Related work
We first introduce the related work about simultaneous NMT, and then briefly summarize the work about leveraging future information.
The related work of simultaneous NMT can be categorized by whether using a fixed decoding scheduler or an adaptive one. Wait- is the representative method for fixed scheduler ma2019stacl , where the decoding is always words behind the source input. Although the method is simple, it achieves surprisingly good results in terms of translation quality and controllable latency, and has been extended to speech-related simultaneous translation zhang2019simuls2s ; ren2020simulSpeech . Similar idea exists in dalvi2018incremental which used rule-based schedulers. Among methods that use adaptive schedulers, zheng2020simultaneous leveraged a collection of wait-
models with different waiting thresholds, and designed a heuristic rule to adaptively determine whichwait- to use. Monotonic Infinite Lookback Attention (MILk) leveraged monotonic attention as an end-to-end learnable adaptive scheduler arivazhagan-etal-2019-monotonic . Multihead Monotonic Attention (MMA) extended the idea to multihead attention, and proposed two mechanism: MMA-IL (Infinite Lookback), which has higher translation quality, and MMA-H(ard), which is more computational efficient ma2020monotonic . zheng-etal-2019-simultaneous
applied imitation learning to simultaneous NMT and designed a restricted dynamic oracle.zheng-etal-2019-simpler proposed another oracle generated by a conventional NMT teacher according to predefined rules.
Action prediction is another typical application of leveraging future information 8543243 ; cai2019action ; 8099873 . The task is, given an action video recorded by a series of frames, we need to predict the action as early as possible (i.e., leveraging partial information only). A common practice of the above work is to first learn a feature on the complete video, and then distill it to the partial information predictor. Leveraging future information is also studied in game AI, like Suphx li2020suphx and AlphaStar vinyals2019grandmaster . The success of the above applications motivates us that leveraging future information has great potential to improve the performances.
3 Problem formulation and background
In this section, we first introduce notations used in this work, followed by the formulation of wait- strategy, and then we introduce our network architecture adapted from ma2019stacl .
3.1 Notations and formulation
Let and denote the source language domain and target language domain. For any and , let and denote the -th token in and respectively. and denote the numbers of tokens in and . Let denote a prefix of , which is the subsequence , and similarly for . Let and denote the training and validation sets, both of which are collections of bilingual sentence pairs.
The wait- strategy ma2019stacl is defined as follows: given an input , the generation of the translation is always tokens behind reading . That is, at the -th decoding step, we generate token based on (more strictly, ). Our goal is to obtain a model with parameter that can achieve better results with wait-.
3.2 Model architecture
Our model for simultaneous NMT is based on Transformer model vaswani2017attention . The model includes an encoder and a decoder, which are used for incrementally processing the source and target sentences respectively. Both the encoder and decoder are stacked of blocks. We mainly introduce the differences compared to the standard Transformer.
(1) Incremental encoding: Let denote the output of the -th position from block . For ease of reference, let denote , and let denote the embedding of the
-th token. An attention model,, takes a query ( is the dimension of the query), a set of keys and values as inputs. and are of equal size, and and are the -th key and value. attn is defined as follows:
where the ’s are the parameters to be optimized. In the encoder side, the are obtained in a unidirectional way:
That is, the hidden representations can only attend to the previously generated hidden representations, and the computation complexity is. In comparison, ma2019stacl still leverages bidirectional attention, whose computation complexity is . We find that unidirectional attention is much more efficient than bidirectional attention without much accuracy drop (see Appendix D.1 for details).
(2) Incremental decoding: Since we use wait- strategy, the decoding starts before reading all inputs. At the -th decoding step, the decoder can only read . When , the decoder greedily generates one token at each step, i.e., the token is , where is the vocabulary of the target language. When , the model has read the full input sentence and can generate words using beam search.
4 Our method
We first introduce our algorithm on leveraging future information with a bi-level optimization. Then we discuss its relationship with several other heuristic algorithms that leverage future information.
We introduce a task controller parameterized by , which adaptively assigns the current input a task wait-, where ,
is a pre-defined hyperparameter. The input ofconsists of two parts: (i) information of the data ; (ii) information of the network state (e.g., historical losses and actions, previous actions). For ease of reference, denote these inputs as . We will discuss how to design in Section 5.1.
Denote as the validation metric of the wait- strategy, which is evaluated on data with model . Our idea is formulated as a bi-level optimization problem. That is,
In Eqn.(3), we can see that we learn a translation model under the guidance of the controller . The goal of the controller is to maximize the validation performance using wait-, and it adaptively assigns a wait- task to the input , by which the student model can leverage more information, especially the future information.
We optimize Eqn.(3) in an alternative way, where we first optimize with a given , then update using the REINFORCE algorithm. We repeat the above process until convergence. Details can be found in Algorithm 1.
Algorithm 1 will be executed for episodes (i.e., the outer loop), and each episode consists of a -step inner loop. The inner loop (from line 4 to line 6) aims to optimize where we can choose the update the parameter with any gradient based algorithm like momentum SGD, Adam adam_optimizer , etc. The outer loop aims to optimize . can be regarded as a policy network, where the state is , the action is the choice of the task wait-, , and the reward is the validation performance (line 7). At the end of each episode, we update using REINFORCE algorithm (line 8).
Under our framework, where we adaptively assign a task wait- to the input, there are some heuristic approaches:
(1) Random sampling (briefly, Random): When an input comes, randomly choose from
with equal probability;
(2) Curriculum learning (briefly, CL): We gradually decrease from to the threshold we will use in the test setting. There are several ways to decrease the , e.g., ladder-like, conic, logarithmic, etc (see Appendix A.2 for mathematical definition).
There are some limitations for the above two strategies. For Random, it always explores all possible ’s, even if some wait- is certainly not a good strategy. For CL, we need to manually design when to decrease , which is a challenging problem. We overcome the above two problems by introducing a controller, which adaptively determines how much exploration the model requires and how long we should use a specific wait- strategy.
We work on the text-to-text simultaneous NMT in this paper and leave the speech-to-speech version in the future. We conduct experiments on three small-scale IWSLT datasets: IWSLT’14 EnglishGerman, IWSLT’15 EnglishVietnamese and IWSLT’17 EnglishChinese, and a large-scale dataset: WMT’15 EnglishGerman translation. Briefly denote English, German, Vietnamese and Chinese as En, De, Vi and Zh respectively.
Datasets: For IWSLT’14 EnDe, following edunov-etal-2018-classical , we lowercase all words, tokenize them and apply BPE sennrich-etal-2016-neural with merge operations jointly to the source and target sequences. We split sentences from the training corpus for validation and the remaining sequences are left as the training set. The test set is the concatenation of tst2010, tst2011, tst2012, dev2010 and dev2012, which consists of sentences. For IWSLT’15 EnVi, following ma2020monotonic , we use tst2012 as validation set and tst2013 as test set. For IWSLT’17 EnZh, we tokenize the data and apply BPE with merge operations independently to the source and target sequences. We concatenate tst2013, tst2014 and tst2015 as the validation set and use tst2017 as test set. For WMT’15 EnDe, following ma2019stacl ; arivazhagan-etal-2019-monotonic , we tokenize the data, apply BPE with merge operations jointly to the source and target sentences, and get a training corpus with sentences. We use newstest2013 as validation set and use newstest2015 as the test set. More details about datasets can be found at Appendix B.
Models: The translation model is based on Transformer. For IWSLT EnZh and EnVi, we use the transformer small model, where the embedding dimension, feed-forward layer dimension, number of layers are , and respectively. For IWSLT EnDe, we use the same architecture but change the embedding dimension into . For WMT’15 EnDe, we use the transformer big setting, where the above three numbers are , and respectively. The controller
for each task is a multilayer perceptron (MLP) with one hidden layer and thetanhactivation function. The size of the hidden layer is .
Input features of : The input is a
-dimension vector containing: (1) the ratios between the lengths of the source/target sentences to the average source/target sentence lengths in all training data (dimensions); (2) the training loss evaluated by wait-
; (3) the average of historical training losses; (4) the validation loss of the previous epoch; (5) the average of historical validation loss; (6) the ratio of current training step to total training iteration.
Training strategy: For the translation model, we use Adam adam_optimizer optimizer with initial learning rate and inverse_sqrt scheduler (see Section 5.3 of vaswani2017attention for details). The batch size and the number of GPUs of IWSLT EnDe, EnVi and WMT’15 EnDe are GPU, GPU and GPU respectively. For IWSLT tasks, the learning rate is grid searched from with vanilla SGD optimizer, and the internal update iteration is grid searched from , where is the number of updates in an epoch of the student model training. For WMT’15 EnDe, the student model is warm started from pretrained wait- model, the learning rate is set as , and the internal update iteration is .
The validation performance is the inverse of the validation loss with wait- strategy. To stabilize the training, we minus a baseline to the in Line 7 of algorithm 1. The baseline is the validation loss of the previous episode, i.e., , and is the inverse validation loss of the randomly initialized model. That is, the validation signal at episode is .
Baselines: We implement the Random and CL in Section 4.2 as baselines. We design another baseline where we train all the wait- strategies, , select the best model according to validation performance and use it for test time wait-. The waiting threshold for training the best model is denoted as , and this baseline is denoted as wait-.
Evaluation: We use BLEU scores to measure the translation quality, and use Average Proportion (AP) and Average Lagging (AL) to evaluate translation delay. Specifically, AP measures the average proportion of source symbols required for translation, and it is unfair between long sentences and short sentences; AL measures the average number of delayed words and overcomes the shortcomings of AP (see Appendix A.1 for details). Following the common practice ma2019stacl ; ma2020monotonic , we show the BLEU-AP and BLEU-AL curves to demonstrate the tradeoff between quality and latency. For IWSLT’14 EnDe and IWSLT’15 EnVi, we use multi-bleu.perl to evaluate the BLEU scores; for IWSLT’17 EnZh and WMT’15 EnDe, we use sacreBLEU to evaluate the detokenized BLEU scores. We use the scripts provided by ma2019stacl to evaluate AP and AL scores.
We first compare our method with the baseline methods on IWSLT datasets. The BLEU-latency curves are shown in Figure 2, and we report the BLEU scores of EnVi under different test-time wait- as shown in Table 1. The BLEU scores of all languages are left in Appendix C.
We have the following observations:
(1) Generally, our method consistently performed the best across different translation tasks in terms of both translation quality and controllable latency. As shown in Table 1, our method achieves the highest BLEU scores among all baselines. In Figure 2, the curve for our method (i.e., the red one) is on the top in most cases, which indicates that given specific latency (e.g., AP or AL), we can achieve the best translation quality.
(2) Baselines like wait-, Random and CL can also outperform the vanilla wait-, which demonstrates the effectiveness of leveraging future information. However, they are not consistent on which one is better. For example, in Figure 2, wait- performed best on EnVi but not good on EnDe. CL performs similarly to Random on EnDe dataset, but much better than Random on EnVi and especially on EnZh dataset. In comparison, with our method which is guided by a controller, the improvement is much more consistent.
(3) The improvement brought by our method is more significant with smaller ’s than that with bigger ’s. We observe that all baselines perform well with bigger , where more information is available during inference. That is, the advantages of leveraging future information are less significant.
We further compare our method with MILk arivazhagan-etal-2019-monotonic , MMA-IL ma2020monotonic and MMA-H ma2020monotonic on IWSLT EnVi. BLEU-AL curves are shown in Figure 4 and the BLEU-AP curves are in Appendix C. When AL , our method outperforms all baseline models. When AL , our method performs slightly worse than MMA-IL and MMA-H. We will combine our method with MMA-IL and MMA-H in the future.
The results of WMT’15 EnDe, whose training corpus is larger, are shown in Figure 4. Due to resources limitation, we only compared our method with wait- and wait- baselines. Our method consistently outperforms wait- baseline and wait- baseline, which demonstrates that our method also improves the performance on large datasets.
(I) Strategy analysis: We visualize our learned strategies for EnZh wait- and wait-9 translation in Figure 5. We show the frequency on the wait- strategy that the teacher model outputs at the th, st, th, th and th episode.
We observed that the controller uniformly samples different at first, and then the strategies converge within episodes. After convergence, the controller mainly samples several specific actions, i.e., for wait-, and for wait-. The action that both controllers prefer most is , which is close to the wait- strategy () for both wait- and wait-. Generally, these two strategies assign most of the sampling frequency to large , which again shows the importance of using future information. However, it is worth noting that the controller also samples smaller , which means that the past information is also utilized. For example, the controller for wait- still samples with a probability about . Our conjecture is that the use of past information helps mitigate the mismatch between training and testing. If the model is always trained with future information, this mismatch will be large.
(II) Action space selection: In previous experiments, both future information and past information are leveraged. That is, given a strategy wait-, the controller can sample a strategy wait- where or . We want to study the effect of using past information or future information only. For any wait-, we build another two action spaces for : ; . We evaluate wait- on IWSLT’14 EnDe with the above two action spaces. The results are reported in Table 2.
We observe that our method with full action space significantly outperforms that with and slightly outperforms that using . This shows that leveraging both kinds of information is helpful to improve the performances.
|Full action space|
Ablation study for feature selection on IWSLT’14 EnDe dataset.
(III) Feature selection: To emphasize the importance of the selected features in Section 5.1, we provide four groups of ablation study, where in each group some specific features are excluded: (i) source and target sentence lengths; (ii) current training loss and average historical training loss; (iii) current validation loss and average historical validation loss; (iv) training step. We work on IWSLT’14 EnDe task and study the effect to wait-,
The results are shown in Table 3. We report the BLEU scores only, since the latency metrics (AP and AL) are not significantly influenced. Removing any feature causes the performance drop, indicating that they all contribute to the decision making. Specifically, information of the network state (i.e., feature group (iii) and (iv)) is more important to the decision making compared to the information of the input data (i.e., feature group (i) and (ii)).
|- (i)||23.67 (-1.00%, rank 3)||26.03 (-0.91%, rank 3)||26.92 (-0.19%, rank 4)|
|- (ii)||23.70 (-0.88%, rank 4)||26.04 (-0.88%, rank 4)||26.91 (-0.22%, rank 3)|
|- (iii)||23.57 (-1.42%, rank 1)||25.92 (-1.33%, rank 2)||26.72 (-0.93%, rank 1)|
|- (iv)||23.65 (-1.09%, rank 2)||25.63 (-2.44%, rank 1)||26.86 (-0.41%, rank 2)|
6 Conclusion and future work
In this work, we propose a new approach for simultaneous NMT. Motivated by the fact that wait- benefits from future information, we introduce a controller, which adaptively assigns a task wait- to the input. A bi-level optimization method is leveraged to jointly obtain the translation model and the controller. Experiments on four translation tasks demonstrate the effectiveness of our approach.
For future work, there are many interesting directions. First, we will enhance the objective function in Eqn.(3) beyond using translation quality only and explicitly introduce the latency constraint. Second, we will combine our method with the adaptive decoding methods arivazhagan-etal-2019-monotonic ; ma2020monotonic . Third, we will apply the idea in this work to more applications like action prediction, weather forecasting, game AI, etc.
Appendix A Mathematical definitions
a.1 Latency metrics definitions
Given the input sentence and the output sentence , let and denote the length of and respectively. Define a function of decoding step , which denotes the number of source tokens processed by the encoder when deciding the target token . For wait- strategy, . The definition of Average Proportion (AP) and Average Lagging (AL) are listed in Eqn.(4) and Eqn.(5).
a.2 Mathematical formulation of curriculum learning
In the curriculum learning (briefly, CL) baseline, we gradually decrease from to the threshold which will used in the test setting. There are several ways to decrease the , including ladder-like, conic, logarithmic. The mathematical formulations are shown as follows:
where denotes the total update number, denotes the current update number (), and is a predefined hyperparameter to control the shape of the - curve. The - curves are shown in Figure 6. We use ladder-like CL in our experiments with .
Appendix B Detailed introduction of the datasets
For IWSLT’14 EnDe, following edunov-etal-2018-classical , we lowercase all words, tokenize them and apply BPE with merge operations jointly to the source and target sequences. We split sentences from the training corpus for validation and the remaining sequences are left as the training set. The test set is the concatenation of tst2010, tst2011, tst2012, dev2010 and dev2012, which consists of sentences.
For IWSLT’15 EnVi, following ma2020monotonic , we tokenize the data and replace words with frequency less than 5 by <unk>111The data is downloaded from https://nlp.stanford.edu/projects/nmt/, which has been tokenized.. We use tst2012 as validation set and tst2013 as test set. The training, validation and test sets contains , and sentences respectively.
For IWSLT’17 EnZh, we tokenize the data and apply BPE with merge operations independently to the source and target sequences222The Chinese sentences are tokenized using Jieba ( https://github.com/fxsjy/jieba ).. We use the concatenation of tst2013, tst2014 and tst2015 as validation set and use tst2017 as test set. The training, validation and test sets contains , and sentences respectively. For WMT’15 EnDe, we follow the setting in ma2019stacl ; arivazhagan-etal-2019-monotonic . We tokenize the data, apply BPE with merge operations jointly to the source and target sentences, and get a training corpus with sentences. We use newstest2013 as validation set and use newstest2015 as the test set.
Appendix C Supplemental results
In this section, we report the specific BLEU scores of our methods and the baselines. The BLEU scores for IWSLT tasks are reported in Table 4, and the BLEU scores for WMT EnDe are reported in Table 5. We also report the BLEU-AL curves of our methods and baselines on IWSLT’15 EnVi in Figure 7.
Appendix D Additional ablations and analysis
d.1 Model architecture selection
As mentioned in Section 3 of the main content,we adopt unidirectional attention instead of bidirectional attention in the encoder side. We compare the performance the wait- model with two attention types on IWSLT’14 EnDe dataset, and the results are in Figure 8(a) and Figure 8(b). We also compare our results on WMT’15 EnDe with the results of bidirectional attention models reported by ma2019stacl , and the results are shown in Figure 8(c). On IWSLT’14, we observe that the performance of wait- with unidirectional attention slightly drops than that with bidirectional attention. On WMT’15 EnDe dataset, our implementation of wait- with unidirectional attention is slightly better than that of bidirectional attention reported in ma2019stacl . However, the computational cost of bidirectional attention is much larger than unidirectional attention. For example, the inference speed of unidirectional wait- model is sentences / second, while the inference speed of bidirectional attention is sentences / second.
d.2 Case study
To analyze the effect of using future information, we present two translation examples for EnZh wait-3 translation in Table 6 and Table 7. We observe that all methods tend to anticipate when the future information is lacking (Table 6). Wait-3 makes more mistake (Table 6) and even makes wrong anticipation where there is no need to anticipate (Table 7), while wait- and Ours anticipate more appropriately (Table 6). However, as in Table 7, wait- sometimes generates repeated information, therefore increasing the overall latency. This might be resulted from the gap between training and testing, as wait- is trained to produce higher latency. Our method can leverage the advantages of both methods, and produces translations with the best quality.
|I||was born||in||a||full of||epilepsy||-||Not a word||-||country||.|
|I was born in a country full of epilepsy Not a word.|
|I||was born||-||when||,||I||suffered from||epilepsy||and||intellectual||disability||.|
|When I was born, I suffered from epilepsy and intellectual disability.|
|When I was born, I was accompanied by epilepsy and intellectual disability.|
At step , Wait-3 anticipates "在一个" ( in a), while wait- and Ours anticipates " 的时候" ( when) and " 时" ( when) respectively. The anticipation generated by wait- and Ours are more appropriate within the context, while wait-3 makes mistakes.
|I opened the website and I found my face.|
|I||opened||-||the website||,||opened||the||website||,||I||POS||face||PROG||PROG||looking at||me||.|
|I opened the website, opened the website, and my face was looking at me.|
|I||opened||-||the website||,||then||-||there was||-||I||POS||face||stare||PROG||me||.|
|I opened the website, and then there was my face staring at me.|
-  Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1313–1323, Florence, Italy, 2019. Association for Computational Linguistics.
Yijun Cai, Haoxin Li, Jian-Fang Hu, and Wei-Shi Zheng.
Action knowledge transfer for action prediction with partial videos.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8118–8125, 2019.
-  Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. Incremental decoding and training methods for simultaneous translation in neural machine translation. arXiv preprint arXiv:1806.03661, 2018.
-  Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 355–364, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
-  Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
-  Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In , pages 3662–3670, 2017.
-  Y. Kong, Z. Tao, and Y. Fu. Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):539–553, 2020.
-  Junjie Li, Sotetsu Koyamada, Qiwei Ye, Guoqing Liu, Chao Wang, Ruihan Yang, Li Zhao, Tao Qin, Tie-Yan Liu, and Hsiao-Wuen Hon. Suphx: Mastering mahjong with deep reinforcement learning. arXiv preprint arXiv:2003.13590, 2020.
-  Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Florence, Italy, July 2019. Association for Computational Linguistics.
-  Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. Monotonic multihead attention. In 8th International Conference on Learning Representations, 2020.
-  Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu. Simulspeech: End-to-end simultaneous speech to text translation. In ACL, 2020.
-  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu,
Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds,
Petko Georgiev, et al.
Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
-  Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang, Zhongjun He, Hua Wu, and Haifeng Wang. Dutongchuan: Context-aware translation model for simultaneous interpreting. arXiv preprint arXiv:1907.12984, 2019.
-  Chen Zhang, Xu Tan, Jinglin Liu, Yi Ren, Tao Qin, and Tie-Yan Liu. Simuls2s: End-to-end simultaneous speech to speech translation. Openreview, 2019.
-  Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma, Hairong Liu, and Liang Huang. Simultaneous translation policies: From fixed to adaptive. In ACL, 2020.
Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang.
Simpler and faster learning of adaptive policies for simultaneous
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1349–1354, Hong Kong, China, November 2019. Association for Computational Linguistics.
-  Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Simultaneous translation with flexible policy via restricted imitation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5816–5822, Florence, Italy, July 2019. Association for Computational Linguistics.