Simultaneous translation has the potential to automate simultaneous interpretation. Unlike a consecutive interpretater who waits until the speaker pauses (usually at sentence boundaries) to start translating, and thus doubling the time needed, a simultaneous interpretater performs translation concurrently with the speaker’s speech, with a delay of only a few (3) seconds. This additive overhead is much more desirable than the multiplicative overhead of 2 in consecutive interpretation.
With this appealing property, simultaneous interpretation has been gradually replacing consecutive interpretation ever since the famous Nuremburg Trials The Nuremberg Trials (1946); Palazchenko (1997), and has been widely used in many scenarios including multilateral organizations (such as UN and EU), international summits (such as APEC and G-20), and bilateral/multilateral negotiations. However, due to the concurrent comprehension (in the source language) and production (in the target language), it is an extremely challenging and exhaustive task for human-beings: there are reportedly only a few thousand qualified simultaneous interpreters worldwide, and each can only last for about 20-30 minutes in one turn whose error rates grow exponentially after just minutes of interpreting Moser-Mercer et al. (1998). Moreover, limited memory forces human interpreters to routinely omit source content He et al. (2016), and the best of them can only retain about 60% of source material. Therefore, there is a huge demand for simultaneous interpretation but not nearly enough supply, leaving a critical need to develop simultaneous machine translation techniques to reduce the burden of human interpreters United Nations (1957) and make this service more accessible and affordable.
Unfortunately, simultaneous translation is also notoriously difficult for machines, due in large part to the diverging word order between the source and target languages. For example, think about simultaneously translating an SOV or underlyingly SOV language such as Japanese or German to an SVO language such as English or Chinese: you have to wait until you see the source language verb. As a result, existing commercial “real-time” translation systems resort to conventional full-sentence translation, causing an undesirable latency of at least one sentence. Researchers, on the other hand, have attempted to reduce latency by explicitly predicting the sentence-final German verb Grissom II et al. (2014) which is limited to this particular case, or unseen syntactic consituents Oda et al. (2015)
which requires incremental parsing on the source sentence. Others use reinforcement learning to prefer (rather than enforce) a specific latencyGu et al. (2017), which as a result is rarely met at test time. All these efforts have two major limitations: (a) none of them can achieve any arbitrary given latency such as “3-word delay”; and (b) their systems are overcomplicated, involving many components (such as prediction) and are slow to train.
We instead propose a very simple but surprisingly effective idea for this problem and make the following contributions:
We propose a simple “wait-” model trained to generate the target sentence concurrently with the source sentence, but always words behind, for any given . This strategy therefore can satisfy any latency requirements.
Unlike previous works with explicit prediction of source words, our model directly predicts target words, and seamlessly integrates anticipation and translation in a single model.
We also propose a new metric of latency called “Averaged Lagging”, which addresses deficiencies in previous metrics.
Experiments show our strategy achieves low latency and reasonable BLEU scores (compared to full-sentence translation baselines) on both Chinese-to-English and English-to-German simultaneous translation.
2 Background: Conventional Neural MT
We first briefly review standard (full-sentence) neural translation to set up the notations.
Regardless of the particular design of different sequence-to-sequence models, we first encode the source language input sentence and pass the encoded representations to the decoder for generation in the target language. Generally speaking, the encoder takes an input sequence of elements where each a word embedding of dimensions, and produces a new sequence of hidden states . The encoder function can be implemented by RNN or Transformer.
On the other hand, a (greedy) decoder predicts the next output word given the source sequence (actually its representation ) and the previously generated words . The decoder will continue generating more words until it emits </eos>. When decoding finishes, the generated hypothesis is where is </eos>
, with probability
where denotes the prefix . At training time, we maximize the conditional probability of each ground-truth target sentence given input over the whole training data , or equivalently minimizing the following loss:
In conventional neural machine translation, eachin Eq. 1 is predicted based on the entire source side context . However, in the scenario of simultaneous translation where we need immediate translation outputs before the entire source sentence finishes, we need to design a different way of generating predictions.
3 “Wait-” Model: Prediction based on Input Prefix
To have a simple overview of our proposed “wait-” process, we only need to adapt the prediction on decoder side into making prediction on the next output word based on the input prefix and output prefix . We use to represent the processed source sequence and to represent the word-level latency which means that the target side is always word(s) behind the source side. Then we have the following wait- model:
To have a more general definition which can be applied to any arbitrary policy, we replace the above with to describes the number of source words that have been processed by the encoder at decoding time step :
then our new decoding and training objective becomes:
where denotes the prefix .
Eq. 4 describes the number of words that are observed by decoder from source side. When our model decodes the first word, there are words are observed at encoder. Since the second decoder step, every later step will observe one more word on source side. After time step , encoder receives the entire source sentence and the number of observed words stops increasing.
The above decoding strategy in Eq. 5 is different from Eq. 1 and defines an encoding-decoding policy which first waits words on the source side, then starts to decode the first translated word on target. Thenceforth the decoder will generate a new target word every time another source word is fed into the encoder.
Below we detail two different instantiations of this simple framework, with RNN and Transformer being the underlying models, respectively.
3.1 Wait- with RNNs
We first introduce our uni-directional RNN-based approach as one baseline to set up a fair comparison with other methods.
3.1.1 Full Sentence Translation with RNNs
For the basic RNN-based framework, on the encoder side, the RNN maps a sequence into a sequence of hidden states as follows:
Then we collect a list of hidden states to represent the source side. Since we are using uni-directional RNN, we could easily append the new coming word with constant time to the encoder.
On the other side, the decoder takes another RNN to generate the target side’s hidden representations atstep as follows:
The encoder and decoder are parameterized by and , respectively. Note that the decoder takes the entire source context as input during generation for each step.
3.1.2 Simultaneous Translation with RNNs
Different from conventional translation, in the scenario of simultaneous translation, the source words are feed into the system one-by-one after words delay. As mentioned above, during encoding, we could just simply append the new coming words to the end of the existing encoder.
For the decoder side, we need to change Eq. 7 into following definition:
where indicates that the decoder only can observe the first hidden states from encoder. Decoder waits for words on source side, then starts generating a new word once there is another words are feed into the encoder side.
Beside RNN-based, in order to improve our system’s performance even further, we also design a Transformer-based framework.
3.2 Wait- with Transformer
Due to the remarkable performance of Transformer Vaswani et al. (2017), we renovate the Transformer framework into an simultaneous machine translation framework with arbitrary latency constraints. We first briefly review the Transformer architecture with per-stepwise point of view to highlight the difference between the conventional and our proposed Transformer.
3.2.1 Full Sentence Transformer Model
The encoder of Transformer works in a self-attention fashion and takes an input sequence , and produces a new sequence of hidden states where as follows:
where is a project function between input space to value space with parameter , and denotes the attention weights which are computed with a softmax function:
The above measures the similarity between two elements:
where and project and to query and key space correspondingly with parameters and . When more self-attention layers are needed, these projection parameters are unique for each layer and attention head. We use 6 layers of self-attention in our model and use to denote the top layer output sequence (i.e., the source context).
On the decoder side, during training time, the given gold output sequence is operated with same above self-attention fashion in the first place to generate hidden self-attended state sequence . Note that on decoder side, we let if in Eq. 10 to restrict the self-attention to previously generated words.
In each layer, after we gather all the hidden representations for each target word through self-attention, we then operate the target-to-source attention as follows:
3.2.2 Simultaneous Transformer Model
As mentioned above, simultaneous translation needs to start generating translations before the input sentence finishes. This would require our model feed the source sentence incrementally to the encoder, and the decoder also needs to predict a new target word once the encoder gets a new source word. Obviously, training the incremental encoder and decoder is inefficient. In order to train our framework with above requirements efficiently, we need to make several modifications on both encoder and decoder sides.
For encoder side, during training time, we still feed the entire sentence at one time to the encoder. But different from the self-attention layer which is defined in Eq. 9 and Eq. 10 in conventional Transformer, we constrain each word to attend only to the prefix of sentence as follows:
when both and are within , where is latency constraint, e.g., wait words, and is otherwise. The translation time step is represented by . For example, when we are at the translation step with -words latency, there are words in the stack of encoder and words have been translated by decoder. And we redefine the Eq. 10 as follows:
The above self-attention layer only allows each word attends to its’ prefix, which generates equivalent representations to the incremental scenario since every word is blind to the latter words. When we stack more prefix self-attention blocks as in conventional Transformer, the former words are still immune to the future words’ information. In this way, we simulate the incremental environment when the full source sentence are observable during training time.
However, there is still another problem for encoder’s context representation generation.
3.2.3 Context Caching
The way Transformer generating context representation is different from RNN-based framework, especially uni-directional RNN-based models. In uni-directional RNN-based model, since each word’s hidden representation only depends on previous state, when there is an new word which is append to the source sentence, all the representation from prefix words are unchanged. However, this is different in Transformer due to the self-attention mechanism.
As we observe from the definitions of and , when there is another word that has been fed into encoder, the translation time step increments from to . Every word which is prior to the new word
on the source side needs to adjust its’ existing representation to a new representation in order to absorb the new word’s information. Therefore, different from conventional Transformer, our framework has a list of incremental source side context information instead of only one tensorin previous section to represent the entire source side. Our source side context is defined as follows:
where is the total length of source sentence, and represents the context representation of the first words. Note that equals to in previous section.
On the other hand, for the source-target attention on decoder side, at translation step , the decoder only can observe the context in .
4 Refinements: Wait- with Catchup
In aforementioned section, the decoder of proposed “wait-” process is always words behind the incoming source stream. In the ideal case which the source and target sentences are in the same length, the last words on target side will be generated without waiting the encoder since there is no new word is expected on encoder side (for example, a period on source side indicates the source sentence finishes).
From the user perspective, all the previous words are decoded in a one-by-one fashion while the last words are shown at one time on the screen (this phenomenon can be easily observed in our on-line demo). This increases the user reading work load suddenly at the end of the sentence. This kind of increases of work load might be fine when is small. However, when target side are expected to be much longer than source side, this phenomenon becomes displeasing for users when there are much more that words are threw to users to digest at one time.
Based one the previous study of translation ratio from huang+:2017,yilin+:2018, it is well known that the target side could be much longer than source side between some certain translation pairs, e.g., Chinese-to-English. In our Chinese-English development set (NIST 06), the target-side English sentence lengths are on average the Chinese sentence lengths.
Fig. 2 shows one example of translation from Chinese to English. Note that there are words on English side while there are only words on Chinese side. Left figure shows that in wait words policy, the translated English sentence is alway words behind Chinese inputs. After wait words gap, every English word corresponds to one Chinese input. However, the last row indicates that there five English words have been translated at one time during the last step. The sudden appearance of these five English words would increase the readers’ work load. Therefore, we propose to “amortize” these sudden increased word loads to previous translation steps.
For example, assume the translation ratio is which can get from training corpus, we design another read and write action sequence pattern as . This means that the decoder will decode words when there are new coming words. And the extra one word will always be in the third decoding step for every translation time step. In this way, we only expect to generate words instead of words. Note that when sentence gets longer, this different would be more significant.
Our new decoding and training strategy are as follows:
where we use to describes the number of source words that have been processed by encoder in catchup mode:
where we use to represent the index for extra decoding. For example, in Fig. 2, catchup index is . The initialization value for is , and increases by one when . In this way, during training time, for the example in Fig. 2, the second and third decoded words share the same context information from source side. The desired can be learned from training corpus. For example, assume we have a length ratio of , then we know we should have one extra decoding at time step since .
5 A New Latency Metric
Beside accuracy, latency is another crucial measurement for judging how much time is wasted waiting for the translation. This section first introduce the existing metrics for latency measurements. We then show the problems of existing latency metrics. At the end, we propose our new defined latency metrics.
5.1 Existing Metrics: CW and AP
Consecutive Wait (CW) defines the consecutive waits while translating each target word. It describes the length of silence between two translated adjacent words. For each READ or WRITRE action, we have the CW definition as follows:
where is the action at time step and is the number of waited words between two adjacent translated words.
Another latency measurement, Average Proportion (AP) Cho and Esipova (2016) focuses on the global latency which is defined as follows:
Same with Eq. 4, from above equation measures the number of source words been waited when decoding the on target side.
We observe from the above definitions that CW only defines the local latency and hard to define the global delays.
AP is defined to show the global delays but there is an obvious flaw of it. For example, we have an translation policy which decodes one target word once there is a new coming source word on the decoder side. More explicitly, the action sequence is “R W R W R W R W…” or “(R W)”, where the number of R’s equals to the length of source words, and the number of W’s equals to the length of target words. Assume we have two source words and they can be translated into anther two words in target language. Following Eq. 14, we have a latency of . However, when we increase the length for both source and target side infinity, we then will have a different latency for the same policy whose AP almost equals to . Furthermore, it is still not obvious to the user about the actual delays in number of words when AP is defined in percentage.
Therefore, AP is very sensitive to the actual length on both side and it only can make a fair comparison between two policies which are operated on the same source and target length. In order to have a global latency definition which is not correlated with the source and target sides’ length and more easy to understand by users, we propose another latency measurements Average Lagging (AL).
5.2 New Metric: Average Lagging
Fig. 3 shows the basic idea of intuition of our proposed metric. The left figure shows a special case when for simply demonstration. The thick black line indicates the policy which decoder is alway one word ahead of encoder and we define this policy have the AL of . The yellow squares are on the diagonal represents the “wait-1” policy which starts translating every new words from the first source word. In this case, our AL is which measures these averaged yellow areas which are below the thick black line across different steps. When we have “wait-4” policy which starts translating every new words from the third source word, we use all the red area and the first yellow square (start from left) to represents the lagging whose AL is .
When we have the cases on the right side of Fig. 3 when , we notice that there are more and more delays that are cumulated when target sentence grows. For example, for the yellow “wait-1” policy, here are more than delay at decoding step while we only have word delay on the left case. This difference is mainly caused by translation ratio. For the right example, there are words that are generated for each source word.
Based on the above observation, we need to define a more general latency measurement that takes length ratio in to consideration. More formally, we have the AL definition as follows:
where we use to find the earliest point when encoder observes the full source sentence and to represent the target-to-source length ratio. For the right example in Fig. 3, are and for the yellow policy and the red policy respectively.
Eq. 15 describes the average delayed words. Note that we sum the laggings up to since when we can decode all the rest words without extra waiting after the entire source sentence are observed. “wait-1” policy’s AL on right figure is greater than the one on the left side, which shows that our AL is more sensitive to the actual ratio.
This section first showcases the accuracy and latency of our proposed “wait-” model. Then, we demonstrate that our catchup model reduces the latency even further with a little sacrifice of accuracy.
The performance of our models are demonstrated on both English-to-German and Chinese-to-English translation tasks. We use the parallel corpora available from WMT15222http://www.statmt.org/wmt15/translation-task.html for English-to-German translation (4.5M sentence pairs) and NIST corpus for Chinese-to-English translation (2M sentence pairs). We first apply BPE Sennrich et al. (2015)
on both sides in order to reduce the vocabulary for both source and target sides. We then exclude the sentences pairs whose length are longer than 50 and 256 words for English-to-German and Chinese-to-English respectively. For English-to-German, we have the development and testing set with the sizes of 3,003 and 2,169, respectively. Our implementation is adapted from PyTorch-based OpenNMTKlein et al. (2017). For Chinese-to-English, we use NIST 06 (616 sentence pairs) and NIST 08 (691 sentence pairs) as our development and testing set. In the catchup experiments, we use the catchup pattern which represents the length ratio of 1.25 from source side to target side. The length ratio of 1.25 is from training corpus. For English-to-German translation task, we do not need catch since the source-to-target ratio is almost 1.
In the following experiments, we report 1-reference, 3-reference and 4-reference BLEU scores to compare different models. Note that the human reference translations are obtained from non-simultaneous scenarios.
Our Transformer’s parameters are as the same as the base model’s parameter settings in the original paper Vaswani et al. (2017).
6.1 Performance of “Wait-” Model
For Fig. 4, we compare the BLEU score and AP with the model from Gu et al. (2017) on test set for English-to-German task. From the results we can tell that our RNN-based model outperforms the model from gu+:2017. Our Transformer achieves much better performance. In Fig. 5, we also compare the BLEU score together with AL between RNN and Transformer-based models.
For Fig. 6 and Fig. 7, we compare the accuracy and different latency measurements for both “wait-” and “catchup” models on development set. As it is shown, we note that the “wait-4” catchup model has similar latency with “wait-1” model. This demonstrates that our catchup model indeed improves the latency especially for the long sentences. Our test set data comparison is shown in Table 1.
|Transformer wait-k +catchup||AP||0.54||0.60||0.62||0.65||0.68||0.72||0.73||0.76||0.79||0.81||1||1|
|Wait-3||the||olympic||games||will||be||held in beijing in 2008|
|+catchup||the||olympic||games||will be||held||in beijing in 2008|
|Baseline||the olympic games will be held in beijing in 2008|
|Baseline||the olympic games will be held in beijing in 2008|
|Wait-3||because||of||the||tur-||bul-||ence||in||the||middle east , a series of conflicts have occurred|
|+catchup||because||of||the||tur- bul-||ence||situation||in||the middle||east , a series of conflicts broke out|
|Baseline||a series of conflicts broke out in the middle east|
|because of turbulent situation|
|Baseline||a series of conflicts broke out due to turbulence|
|in the middle east region|
|Wait-3||jiang||zemin||expressed||his||appreciation||for bush ’s speech|
|Wait-5||jiang||zemin||expressed||regret over bush ’s speech|
|Baseline||jiang zemin expressed regret over bush ’s speech|
|Baseline||jiang zemin expressed regret over bush ’s speech|
We showcase some real running examples which are generated from our proposed model and baseline framework for demonstrating the effectiveness of our system. For all the following tables, the first line is Chinese inputs which are represented by pinyin for easier reference to non-Chinese speaker. The second line is the “gloss” which is the word-by-word translation. Our “wait-3” system’s outputs are in the third line of the table with words behind the inputs. Baseline method which starts generating words after the entire source sentence are encoded in the last row.
Table 4 demonstrates that our system could move the prepositional phrase to the end of the sentence depend on the situation. For the case in Table 4, the year information is the word on source side, while the first decoding step only can see the first three words. Then the decoder move the time information to the end of the sentence. Table 4 shows that our model also can move reason clause to the beginning of the sentence and location clause to the middle of sentence.
Table 4 shows another comparison between “wait-3” and “wait-5” models. During decoding time, “wait-3” can not observe any sentimental information at decoding time since the sentimental information “yíhàn” (means regret) only shows up at the end of sentence. Therefore, the “wait-3” model makes the prediction of “appreciation” based on training experience. For “wait-5” model which can observe the sentimental information at the decoding step, the model can correctly predict the word “regret”.
We also make further analysis about the probability difference between these two words: “wait-3” model predict “appreciation” with probability of while the word “regret” ranks at the with the probability of . For the “wait-5” model, after the model observed the sentimental information “yíhàn” (means regret) from the source side, the word “appreciation” was degraded to place with a confident of and the model promotes “regret” to the first place with confident of which is higher probability than the word “appreciation” in “wait-3” model.
7 Related Work
In a parallel work released after our paper, press+smith:2018 propose an “eager translation” model which also outputs target-side words before the whole input sentence is fed in, but there are several crucial differences: (a) their work “is not a simultaneous translation model”; it still aims at translating full sentences using beam search; (b) their work does not anticipate future words; (c) they use word alignments to learn the reordering and achieve it in decoding by emitting the token, while our work integrates reordering into a single wait- prediction model that is agnostic of, yet capable at, reordering.
We have presented a very simple framework of “wait-” predictive policy that can achieve simultaneous translation with arbitrary low latency while maintaining high translation quality.
We thank Hua Wu, Kenneth Church, Jiaji Huang, Renjie Zheng, and Hao Zhang for discussions and comments, and Colin Cherry for spotting a mistake in the AL definition (Eq. 15).
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
- Cho and Esipova (2016) Kyunghyun Cho and Masha Esipova. 2016. Can neural machine translation do simultaneous translation? volume abs/1606.02012. http://arxiv.org/abs/1606.02012.
Grissom II et al. (2014)
Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal
Daumé III. 2014.
Don’t until the final verb wait: Reinforcement learning for
simultaneous machine translation.
Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). pages 1342–1352.
- Gu et al. (2017) Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O. K. Li. 2017. Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers. pages 1053–1062. https://aclanthology.info/papers/E17-1099/e17-1099.
- He et al. (2016) He He, Jordan Boyd-Graber, and Hal Daumé III. 2016. Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In North American Association for Computational Linguistics.
Huang et al. (2017)
Liang Huang, Kai Zhao, and Mingbo Ma. 2017.
When to finish? optimal beam search for neural text generation (modulo beam size).In EMNLP.
- Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints .
- Moser-Mercer et al. (1998) Barbara Moser-Mercer, Alexander Künzli, and Marina Korac. 1998. Prolonged turns in interpreting: Effects on quality, physiological and psychological stress (pilot study). Interpreting 3(1):47–64.
- Oda et al. (2015) Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Syntax-based simultaneous translation through prediction of unseen syntactic constituents. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). volume 1, pages 198–207.
- Palazchenko (1997) Pavel Palazchenko. 1997. My Years with Gorbachev and Shevardnadze: The Memoir of a Soviet Interpreter. Penn State University Press.
- Press and Smith (2018) Ofir Press and Noah A. Smith. 2018. You may not need attention.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .
- The Nuremberg Trials (1946) The Nuremberg Trials. 1946. (film) original footage from the nuremberg trials as well as interviews with chief-interpreters at the time. Geneva: AIIC archives.
- United Nations (1957) United Nations. 1957. Health problems of interpreters. report submitted by the medical director of the united nations, new york, to the secretary general. Geneva: AIIC archives.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30.
- Yang et al. (2018) Yilin Yang, Liang Huang, and Mingbo Ma. 2018. Breaking the beam search curse: A study of (re-) scoring methods and stopping criteria for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.