Simultaneous translation, which starts translation before the speaker finishes, is extremely useful in many scenarios, such as international conferences, travels, and so on. In order to achieve low latency, it is often inevitable to generate target words with insufficient source information, which makes this task extremely challenging.
Recently, there are many efforts towards balancing the translation latency and quality with mainly two types of approaches. On one hand, ma+:2019 propose very simple frameworks that decode following a fixed-latency policy such as wait-. On the other hand, there are many attempts to learn an adaptive policy which enables the model to decide read or write
action on the fly using various techniques such as reinforcement learningGu+:2017; ashkan+:2018; grissom+:2014
, supervised learning over pseudo-oracleszheng+:2019bzheng+:2019, model ensemble zheng+:2020 or monotonic attention xtma+:2019; Arivazhagan+:2019.
Though the existing efforts improve the performance in both translation latency and quality with more powerful frameworks, it is still difficult to choose an appropriate policy to explore the optimal balance between latency and quality in practice, especially when the policy is trained and applied in different domains. Furthermore, all existing approaches are incapable of correcting the mistakes from previous steps. When the former steps commit errors, they will be propagated to the later steps, inducing more mistakes to the future.
Inspired by our previous work on speculative beam search zheng2019speculative, we propose an opportunistic decoding technique with timely correction mechanism to address the above problems. As shown in Fig. 1, our proposed method always decodes more words than the original policy at each step to catch up with the speaker and reduce the latency. At the same time, it also employs a timely correction mechanism to review the extra outputs from previous steps with more source context, and revises these outputs with current preference when there is a disagreement. Our algorithm can be used in both speech-to-text and speech-to-speech simultaneous translation oda+:2014; bangalore+:2012; mahsa+:2013. In the former case, the audience will not be overwhelmed by the modifications since we only review and modify the last few output words with a relatively low revision rate. In the later case, the revisable extra words can be used in look-ahead window in incremental TTS ma2019incremental. By contrast, the alternative re-translation strategy arivazhagan2020re will cause non-local revisions which makes it impossible to be used in incremental TTS.
We also define, for the first time, two metrics for revision-enabled simultaneous translation: a more general latency metric Revision-aware Average Lagging (RAL) as well as the revision rate. We demonstrate the effectiveness of our proposed technique using fixed ma+:2019 and adaptive zheng+:2019b policies in both Chinese-to-English and English-to-Chinese translation.
The conventional full-sentence NMT processes the source sentence with an encoder, where represents an input token. The decoder on the target side (greedily) selects the highest-scoring word given source representation and previously generated target tokens, , and the final hypothesis with
has the highest probability:
Without loss of generality, regardless the actual design of policy, simultaneous translation is represented as:
where can be used to represent any arbitrary fixed or adaptive policy. For simplicity, we assume the policy is given and does not distinguish the difference between two types of policies.
3 Opportunistic Decoding with Timely Correction and Beam Search
For simplicity, we first apply this method to fixed policies. We define the original decoded word sequence at time step with , which represents the word that is decoded in time step with original model. We denote the additional decoded words at time step as , where denote the number of extra decoded words. In our setting, the decoding process is as follows:
where is the string concatenation operator.
We treat the procedure for generating the extra decoded sequence as opportunistic decoding, which prefers to generate more tokens based on current context. When we have enough information, this opportunistic decoding eliminates unnecessary latency and keep the audience on track. With a certain chance, when the opportunistic decoding tends to aggressive and generates inappropriate tokens, we need to fix the inaccurate token immediately.
In order to deliver the correct information to the audience promptly and fix previous mistakes as soon as possible, we also need to review and modify the previous outputs.
At step , when encoder obtains more information from to , the decoder is capable to generate more appropriate candidates and may revise and replace the previous outputs from opportunistic decoding. More precisely, and are two different hypothesis over the same time chunk. When there is a disagreement, our model always uses the hypothesis from later step to replace the previous commits. Note our model does not change any word in from previous step and it only revise the words in .
Modification for Adaptive Policy.
For adaptive policies, the only difference is, instead of committing a single word, the model is capable of generating multiple irreversible words. Thus our proposed methods can be easily applied to adaptive policies.
Correction with Beam Search.
When the model is committing more than one word at a time, we can use beam search to further improve the translation quality and reduce revision rate murray+chiang:2018; ma2019learning.
The decoder maintains a beam of size at step , which is ordered list of pairs , where denotes the step in beam search. At each step, there is an initial beam . We denote one-step transition from the previous beam to the next as
where returns the top-scoring pairs. Note we do not distinguish the revisable and non-revisable output in for simplicity. We also define the multi-step advance beam search function with recursive fashion as follows:
When the opportunistic decoding window is at decoding step , we define the beam search over (include the original output) as follows:
where performs a beam search with steps, and generate as the outputs which include both original and opportunistic decoded words. represents the length of
4 Revision-aware AL and Revision Rate
We define, for the first time, two metrics for revision-enabled simultaneous translation.
4.1 Revision-aware AL
AL is introduced in ma+:2019 to measure the average delay for simultaneous translation. Besides the limitations that are mentioned in cherry2019thinking, AL is also not sensitive to the modifications to the committed words. Furthermore, in the case of re-translation, AL is incapable to measure the meaningful latency anymore.
We hereby propose a new latency, Revision-aware AL (RAL), which can be applied to any kind of translation scenarios, i.e., full-sentence translation, use re-translation as simultaneous translation, fixed and adaptive policy simultaneous translation. Note that for latency and revision rate calculation, we count the target side difference respect to the growth of source side. As it is shown in Fig. 3, there might be multiple changes for each output words during the translation, and we only start to calculate the latency for this word once it agrees with the final results. Therefore, it is necessary to locate the last change for each word. For a given source side time , we denote the outputs on target side as . Then we are able to find the Last Revision (LR) for the word on target side as follows:
From the audience point of view, once the former words are changed, the audience also needs to take the efforts to read the following as well. Then we also penalize the later words even there are no changes, which is shown with blue arrow in Fig. 3. We then re-formulate the as follows (assume ):
The above definition can be visualized as the thick black line in Fig. 3. Similar with original AL, our proposed RAL is defined as follows:
where denotes the cut-off step, and is the target-to-source length ratio.
4.2 Revision Rate
Since each modification on the target side would cost extra effort for the audience to read, we penalize all the revisions during the translation. We define the revision rate as follows:
where dist can be arbitrary distance measurement between two sequences. For simplicity, we design a modified Hamming Distance to measure the difference:
where is a padding symbol in case is shorter than .
Datasets and Implementation
We evaluate our work on Chinese-to-English and English-to-Chinese simultaneous translation tasks. We use the NIST corpus (2M sentence pairs) as the training data. We first apply BPE sennrich+:2015 on all texts to reduce the vocabulary sizes. For evaluation, we use NIST 2006 and NIST 2008 as our dev and test sets with 4 English references. We re-implement wait- model ma+:2019 and adaptive policy zheng+:2019b. We use Transformer vaswani+:2017 based wait- model and pre-trained full-sentence model for learning adaptive policy.
Performance on Wait- Policy
We perform experiments using opportunistic decoding on wait- policies with , opportunistic window and beam size . We select the best beam size for each policy and window pair on dev-set.
We compare our proposed method with a baseline called re-translation which uses a full-sentence NMT model to re-decode the whole target sentence once a new source word is observed. The final output sentences of this method are identical to the full sentence translation output with the same model but the latency is reduced.
Fig. 4 (left) shows the Chinese-to-English results of our proposed algorithm. Since our greedy opportunistic decoding doesn’t change the final output, there is no difference in BLEU compared with normal decoding, but the latency is reduced. However, by applying beam search, we can achieve 3.1 BLEU improvement and 2.4 latency reduction on wait-7 policy.
Fig. 4 (right) shows the English-to-Chinese results. Compare to the Chinese-to-English translation results in previous section, there is comparatively less latency reduction by using beam search because the output translations are slightly longer which hurts the latency. As shown in Fig. 5(right), the revision rate is still controlled under 8%.
Fig. 5 shows the revision rate with different window size on wait- policies. In general, with opportunity window , the revision rate of our proposed approach is under , which is much lower than re-translation.
Performance on Adaptive Policy
Fig. 6 shows the performance of the proposed algorithm on adaptive policies. We use threshold . We vary beam size and select the best one on dev-set. Comparing with conventional beam search on consecutive writes, our decoding algorithm achieves even much higher BLEU and less latency.
5.1 Revision Rate vs. Window Size
We further investigate the revision rate with different beam sizes on wait- policies. Fig. 7 shows that the revision rate is higher with lower wait- policies. This makes sense because the low policies are always more aggressive and easy to make mistakes. Moreover, we can find that the revision rate is not very sensitive to beam size.
We have proposed an opportunistic decoding timely correction technique which improves the latency and quality for simultaneous translation. We also defined two metrics for revision-enabled simultaneous translation for the first time.
L. H. was supported in part by NSF IIS-1817231.