Opportunistic Decoding with Timely Correction for Simultaneous Translation

05/02/2020 ∙ by Renjie Zheng, et al. ∙ Baidu, Inc. 0

Simultaneous translation has many important application scenarios and attracts much attention from both academia and industry recently. Most existing frameworks, however, have difficulties in balancing between the translation quality and latency, i.e., the decoding policy is usually either too aggressive or too conservative. We propose an opportunistic decoding technique with timely correction ability, which always (over-)generates a certain mount of extra words at each step to keep the audience on track with the latest information. At the same time, it also corrects, in a timely fashion, the mistakes in the former overgenerated words when observing more source context to ensure high translation quality. Experiments show our technique achieves substantial reduction in latency and up to +3.1 increase in BLEU, with revision rate under 8

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Simultaneous translation, which starts translation before the speaker finishes, is extremely useful in many scenarios, such as international conferences, travels, and so on. In order to achieve low latency, it is often inevitable to generate target words with insufficient source information, which makes this task extremely challenging.

Recently, there are many efforts towards balancing the translation latency and quality with mainly two types of approaches. On one hand, ma+:2019 propose very simple frameworks that decode following a fixed-latency policy such as wait-. On the other hand, there are many attempts to learn an adaptive policy which enables the model to decide read or write

action on the fly using various techniques such as reinforcement learning

Gu+:2017; ashkan+:2018; grissom+:2014

, supervised learning over pseudo-oracles

zheng+:2019b

, imitation learning

zheng+:2019, model ensemble zheng+:2020 or monotonic attention xtma+:2019; Arivazhagan+:2019.

Figure 1: Besides , opportunistic decoding continues to generate additional words which are represented as . The timely correction only revises this part in future steps. Different shapes denote different words. In this example, from step to , all previously opportunistically decoded words are revised, and an extra triangle word is generated in opportunistic window. From step to , two words from previous opportunistic window are kept and only the triangle word is revised.

Though the existing efforts improve the performance in both translation latency and quality with more powerful frameworks, it is still difficult to choose an appropriate policy to explore the optimal balance between latency and quality in practice, especially when the policy is trained and applied in different domains. Furthermore, all existing approaches are incapable of correcting the mistakes from previous steps. When the former steps commit errors, they will be propagated to the later steps, inducing more mistakes to the future.

Inspired by our previous work on speculative beam search zheng2019speculative, we propose an opportunistic decoding technique with timely correction mechanism to address the above problems. As shown in Fig. 1, our proposed method always decodes more words than the original policy at each step to catch up with the speaker and reduce the latency. At the same time, it also employs a timely correction mechanism to review the extra outputs from previous steps with more source context, and revises these outputs with current preference when there is a disagreement. Our algorithm can be used in both speech-to-text and speech-to-speech simultaneous translation oda+:2014; bangalore+:2012; mahsa+:2013. In the former case, the audience will not be overwhelmed by the modifications since we only review and modify the last few output words with a relatively low revision rate. In the later case, the revisable extra words can be used in look-ahead window in incremental TTS ma2019incremental. By contrast, the alternative re-translation strategy arivazhagan2020re will cause non-local revisions which makes it impossible to be used in incremental TTS.

We also define, for the first time, two metrics for revision-enabled simultaneous translation: a more general latency metric Revision-aware Average Lagging (RAL) as well as the revision rate. We demonstrate the effectiveness of our proposed technique using fixed ma+:2019 and adaptive zheng+:2019b policies in both Chinese-to-English and English-to-Chinese translation.

2 Preliminaries

Full-sentence NMT.

The conventional full-sentence NMT processes the source sentence with an encoder, where represents an input token. The decoder on the target side (greedily) selects the highest-scoring word given source representation and previously generated target tokens, , and the final hypothesis with

has the highest probability:

(1)

Simultaneous Translation.

Without loss of generality, regardless the actual design of policy, simultaneous translation is represented as:

(2)

where can be used to represent any arbitrary fixed or adaptive policy. For simplicity, we assume the policy is given and does not distinguish the difference between two types of policies.

3 Opportunistic Decoding with Timely Correction and Beam Search

Figure 2: The decoder generates target word “his” and two extra words “welcome to” at step when input “zàntóng” (“agreement”) is not available yet. When the model receives at step , the decoder immediately corrects the previously made mistake “welcome” with “agreement” and emits two additional target words (“to President”). The decoder not only is capable to fix the previous mistake, but also has enough information to perform more correct generations. Our framework benefits from opportunistic decoding with reduced latency here. Note though the word “to” is generated in step , it only becomes irreversible at step .

Opportunistic Decoding.

For simplicity, we first apply this method to fixed policies. We define the original decoded word sequence at time step with , which represents the word that is decoded in time step with original model. We denote the additional decoded words at time step as , where denote the number of extra decoded words. In our setting, the decoding process is as follows:

(3)

where is the string concatenation operator.

We treat the procedure for generating the extra decoded sequence as opportunistic decoding, which prefers to generate more tokens based on current context. When we have enough information, this opportunistic decoding eliminates unnecessary latency and keep the audience on track. With a certain chance, when the opportunistic decoding tends to aggressive and generates inappropriate tokens, we need to fix the inaccurate token immediately.

Timely Correction.

In order to deliver the correct information to the audience promptly and fix previous mistakes as soon as possible, we also need to review and modify the previous outputs.

At step , when encoder obtains more information from to , the decoder is capable to generate more appropriate candidates and may revise and replace the previous outputs from opportunistic decoding. More precisely, and are two different hypothesis over the same time chunk. When there is a disagreement, our model always uses the hypothesis from later step to replace the previous commits. Note our model does not change any word in from previous step and it only revise the words in .

Modification for Adaptive Policy.

For adaptive policies, the only difference is, instead of committing a single word, the model is capable of generating multiple irreversible words. Thus our proposed methods can be easily applied to adaptive policies.

Correction with Beam Search.

When the model is committing more than one word at a time, we can use beam search to further improve the translation quality and reduce revision rate murray+chiang:2018; ma2019learning.

The decoder maintains a beam of size at step , which is ordered list of pairs , where denotes the step in beam search. At each step, there is an initial beam . We denote one-step transition from the previous beam to the next as

where returns the top-scoring pairs. Note we do not distinguish the revisable and non-revisable output in for simplicity. We also define the multi-step advance beam search function with recursive fashion as follows:

When the opportunistic decoding window is at decoding step , we define the beam search over (include the original output) as follows:

(4)

where performs a beam search with steps, and generate as the outputs which include both original and opportunistic decoded words. represents the length of

4 Revision-aware AL and Revision Rate

We define, for the first time, two metrics for revision-enabled simultaneous translation.

4.1 Revision-aware AL

AL is introduced in ma+:2019 to measure the average delay for simultaneous translation. Besides the limitations that are mentioned in cherry2019thinking, AL is also not sensitive to the modifications to the committed words. Furthermore, in the case of re-translation, AL is incapable to measure the meaningful latency anymore.

Figure 3: The red arrows represent the changes between two different commits, and the last changes for each output word is highlighted with yellow.

We hereby propose a new latency, Revision-aware AL (RAL), which can be applied to any kind of translation scenarios, i.e., full-sentence translation, use re-translation as simultaneous translation, fixed and adaptive policy simultaneous translation. Note that for latency and revision rate calculation, we count the target side difference respect to the growth of source side. As it is shown in Fig. 3, there might be multiple changes for each output words during the translation, and we only start to calculate the latency for this word once it agrees with the final results. Therefore, it is necessary to locate the last change for each word. For a given source side time , we denote the outputs on target side as . Then we are able to find the Last Revision (LR) for the word on target side as follows:

From the audience point of view, once the former words are changed, the audience also needs to take the efforts to read the following as well. Then we also penalize the later words even there are no changes, which is shown with blue arrow in Fig. 3. We then re-formulate the as follows (assume ):

  

Figure 4: BLEU against RAL using wait- polocies. : wait- policies, : wait- policies, : wait- policies, : wait- policies, : wait- policies, (): re-translation with pre-trained NMT model with greedy (beam search) decoding, (): full-sentence translation with pre-trained NMT model with greedy (beam search) decoding. The baseline for wait- policies is decoding with .
  

Figure 5: Revision rate against window size with different wait- policies. (): re-translation with pre-trained NMT model with greedy (beam search) decoding.
(5)

The above definition can be visualized as the thick black line in Fig. 3. Similar with original AL, our proposed RAL is defined as follows:

(6)

where denotes the cut-off step, and is the target-to-source length ratio.

4.2 Revision Rate

Since each modification on the target side would cost extra effort for the audience to read, we penalize all the revisions during the translation. We define the revision rate as follows:

where dist can be arbitrary distance measurement between two sequences. For simplicity, we design a modified Hamming Distance to measure the difference:

where is a padding symbol in case is shorter than .

  

Figure 6: BLEU against RAL using adaptive policies. Baseline is decoded with and .

5 Experiments

Datasets and Implementation

We evaluate our work on Chinese-to-English and English-to-Chinese simultaneous translation tasks. We use the NIST corpus (2M sentence pairs) as the training data. We first apply BPE sennrich+:2015 on all texts to reduce the vocabulary sizes. For evaluation, we use NIST 2006 and NIST 2008 as our dev and test sets with 4 English references. We re-implement wait- model ma+:2019 and adaptive policy zheng+:2019b. We use Transformer vaswani+:2017 based wait- model and pre-trained full-sentence model for learning adaptive policy.

Performance on Wait- Policy

We perform experiments using opportunistic decoding on wait- policies with , opportunistic window and beam size . We select the best beam size for each policy and window pair on dev-set.

We compare our proposed method with a baseline called re-translation which uses a full-sentence NMT model to re-decode the whole target sentence once a new source word is observed. The final output sentences of this method are identical to the full sentence translation output with the same model but the latency is reduced.

Fig. 4 (left) shows the Chinese-to-English results of our proposed algorithm. Since our greedy opportunistic decoding doesn’t change the final output, there is no difference in BLEU compared with normal decoding, but the latency is reduced. However, by applying beam search, we can achieve 3.1 BLEU improvement and 2.4 latency reduction on wait-7 policy.

Fig. 4 (right) shows the English-to-Chinese results. Compare to the Chinese-to-English translation results in previous section, there is comparatively less latency reduction by using beam search because the output translations are slightly longer which hurts the latency. As shown in Fig. 5(right), the revision rate is still controlled under 8%.

Fig. 5 shows the revision rate with different window size on wait- policies. In general, with opportunity window , the revision rate of our proposed approach is under , which is much lower than re-translation.

Performance on Adaptive Policy

Fig. 6 shows the performance of the proposed algorithm on adaptive policies. We use threshold . We vary beam size and select the best one on dev-set. Comparing with conventional beam search on consecutive writes, our decoding algorithm achieves even much higher BLEU and less latency.

5.1 Revision Rate vs. Window Size


Figure 7: Revision rate against beam size with window size of 3 and different wait- policies.

We further investigate the revision rate with different beam sizes on wait- policies. Fig. 7 shows that the revision rate is higher with lower wait- policies. This makes sense because the low policies are always more aggressive and easy to make mistakes. Moreover, we can find that the revision rate is not very sensitive to beam size.

6 Conclusions

We have proposed an opportunistic decoding timely correction technique which improves the latency and quality for simultaneous translation. We also defined two metrics for revision-enabled simultaneous translation for the first time.

Acknowledgments

L. H. was supported in part by NSF IIS-1817231.

References