A common approach to Spoken Language Translation (SLT) is to use a cascade (or pipeline) consisting of automatic speech recognition (ASR) and machine translation (MT). In a live translation setting, such as a lecture or conference, we would like the transcriptions or translations to appear as quickly as possible, so that they do not “lag” noticeably behind the speaker. In other words, we wish to minimise the latency of the system. Many popular ASR toolkits can operate in an online mode, where the transcription is produced incrementally, instead of waiting for the speaker to finish their utterance. However online MT is less well supported, and is complicated by the reordering which is often necessary in translation, and by the use of encoder-decoder models which assume sight of the whole source sentence.
Some systems for online SLT rely on the streaming approach to translation, perhaps inspired by human interpreters. In this approach, the MT system is modified to translate incrementally, and on each update from ASR it will decide whether to update its translation, or wait for further ASR output (Cho and Esipova, 2016; Ma et al., 2019; Zheng et al., 2019, 2019; Arivazhagan et al., 2019).
The difficulty with the streaming approach is that the system has to choose between committing to a particular choice of translation output, or waiting for further updates from ASR, and does not have the option to revise an incorrect choice. Furthermore, all the streaming approaches referenced above require specialised training of the MT system, and modified inference algorithms.
To address the issues above, we construct our incremental MT system with retranslation approach Niehues et al. (2018); Arivazhagan et al. (2020), which is less studied but more straightforward. It can be implemented using any standard MT toolkit (such as Marian (Junczys-Dowmunt et al., 2018) which is highly optimised for speed) and using the latest advances in text-to-text translation. The idea of retranslation is that we produce a new translation of the current sentence every time a partial sentence is received from the ASR system. Thus, each prefix translation is independent and can be directly handled by a standard MT system. In (Arivazhagan et al., 2020) the retranslation and streaming approaches are directly compared and the former is found to have a better latency–quality trade-off curve.
When using a completely unadapted MT system with the retranslation approach, however, there are at least two problems we need to consider:
MT training data generally consists of full sentences, and systems may perform poorly on partial sentences
When MT systems are asked to translate progressively longer segments of the conversation, they may introduce radical changes in the translation as the prefixes are extended. If these updates are displayed to the user, they will introduce an annoying “flicker” in the output, making it hard to read.
We illustrate these points using the small example in Figure 1. When the MT system receives the first prefix (“Several”) it attempts to make a longer translation, due to its bias towards producing sentences. When the prefix is extended (to “Several years ago”), the MT system completely revises its original translation hypothesis. This is caused by the differing word order between German and English.
|Several years ago||Vor einigen Jahre|
|Several years ago|
The first problem above could be addressed by simply adding sentence prefixes to the training data of the MT system. In our experiments we found that using prefixes in training could improve translation of partial sentences, but required careful mixing of data, and even then performance of the model trained on truncated sentences was often worse on full sentences.
A way to address both problems above is with an appropriate retranslation strategy. In other words, when the MT system receives a new prefix, it should decide whether to transmit its translation in full, partially, or wait for further input, and the system can take into account translations it previously produced. A good retranslation strategy will address the second problem above (too much flickering as translations are revised) and in so doing so address the first (over-eagerness to produce full sentences).
In this paper, we focus on the retranslation methods introduced by Arivazhagan et al. (2020) – mask- and biased beam search. The former is a delayed output strategy which does not affect overall quality, but can significantly increase the latency of the translation system. The latter alters the beam search to take into account the translation of the previous prefix, and is used to reduce flicker without influencing latency much, but can also damage translation quality.
Our contribution in this paper is to show that by using a straightforward method to predict the value of in the mask- strategy, we obtain a more optimal trade-off of flicker and latency than is possible with a fixed mask. We achieve this by making probes of possible source extensions, and observing how stable the translation of these probes is – instability in the translation requires a larger mask. Our method requires no modifications to the underlying MT system, and has no effect on translation quality.
2 Related Works
Early work on incremental MT used prosody (Bangalore et al., 2012) or lexical cues (Rangarajan Sridhar et al., 2013) to make the translate-or-wait decision. The first work on incremental neural MT used confidence to decide whether to wait or translate (Cho and Esipova, 2016), whilst in (Gu et al., 2017)
they learn the translation schedule with reinforcement learning. InMa et al. (2019), they address simultaneous translation using a transformer (Vaswani et al., 2017) model with a modified attention mechanism, which is trained on prefixes. They introduce the idea of wait-, where the translation does not consider the final words of the input. This work was extended by Zheng et al. (2019, 2019)
, where a “delay” token is added to the target vocabulary so the model can learn when to wait, through being trained by imitation learning. The MILk attention(Arivazhagan et al., 2019) also provides a way of learning the translation schedule along with the MT model, and is able to directly optimise the latency metric.
In contrast with these recent approaches, retranslation strategies Niehues et al. (2018) allow the use of a standard MT toolkit, with little modification, and so are able to leverage all the performance and quality optimisations in that toolkit. Arivazhagan et al. (2020) pioneered the retranslation system by combining a strong MT system with two simple yet effective strategies: biased beam search and mask-. Their experiments show that the system can achieve low flicker and latency without losing much performance. In an even more recent paper, Arivazhagan et al. (2020) further combine their re-translation system with prefix training and make comparison with current best streaming models (e.g. MILk and wait- models), showing such a retranslation system is a strong option for online SLT.
3 Retranslation Strategies
Before introducing our approach, we describe the two retranslation strategies introduced in Arivazhagan et al. (2020): mask- and biased beam search.
The idea of mask- is simply that the MT system does not transmit the last tokens of its output – in other words it masks them. Once the system receives a full sentence, it transmits the translation in full, without masking. The value of is set globally and can be tuned to reduce the amount of flicker, at the cost of increasing latency. In Arivazhagan et al. (2020) they showed good results for a mask of 10, but of course for short sentences a system with such a large mask would not produce any output until the end of the sentence.
In biased beam search, a small modification is made to the translation algorithm, changing the search objective. The technique aims to reduce flicker by ensuring that the translation produced by the MT system stays closer to the translation of the previous (shorter) prefix. Suppose that is a source prefix, is the extension of that source prefix provided by the ASR, and is the translation of produced by the system (after masking). Then to create the translation of
, biased beam search substitutes the model probabilitywith the following expression:
where is the token of the translation hypothesis , and is a weighting which we set to 0 when
. In other words, we interpolate the translation model with a function that keeps it close to the previous target, but stop applying the biasing once the new translation diverges from the previous one.
As we noted earlier, biased beam search can degrade the quality of the translation, and we show experiments to illustrate this in Section 6. We also note that biased beam search assumes that the ASR simply extends its output each time it updates, when in fact ASR systems may rewrite their output. Furthermore, biased beam search requires modification of the underlying inference algorithm (which in the case of Marian is written in highly optimised, hand-crafted GPU code), removing one of the advantages of the retranslation approach (that it can be easily applied to standard MT systems).
4 Dynamic Masking
In this section we introduce our improvement to the mask- approach, which uses a variable mask, that is set at runtime. The problem with using a fixed mask, is that there are many time-steps where the system is unlikely to introduce radical changes to the translation as more source is revealed, and on these occasions we would like to use a small mask to reduce latency. However the one-size-fits-all mask- strategy does not allow this variability.
The main idea of dynamic masking is to predict what the next source word will be, and check what effect this would have on the translation. If this changes the translation, then we mask, if not we output the full translation.
More formally, we suppose that we have a source prefix , a source-to-target translation system, and a function , which can predict the next tokens following . We translate using the translation system to give a translation hypothesis . We then use to predict the tokens following in the source sentence to give an extended source prefix , and translate this to give another translation hypothesis . Comparing and , we select the longest common prefix , and output this as the translation, thus masking the final tokens of the translation. If is a complete sentence, then we do not mask any of the output, as in the mask- strategy. The overall procedure is illustrated in Figure 2.
In fact, after initial experiments, we found it was more effective to refine our strategy, and not mask at all if the translation after dynamic mask is a prefix of the previous translation. In this case we directly output the last translation. In other words, we do not mask if but instead output again, where denotes the masked translation for the th ASR input. We also notice that this refinement does not give any benefit to the mask- strategy in our experiments. The reason that this refinement is effective is that the translation of the extended prefix can sometimes exhibit instabilities early in the sentence (which then disappear in a subsequent prefix). Applying a large mask in response to such instabilities increases latency, so we effectively “freeze” the output of the translation system until the instability is resolved.
To predict the source extensions (i.e. to define the function above), we first tried using a language model trained on the source text. This worked well, but we decided to add two simple strategies in order to see how important it was to have good prediction. The extension strategies we include are:
We sample the next token from a language model (LSTM) trained on the source-side of the parallel training data. We can choose possible extensions by choosing distinct samples.
This also uses an LM, but chooses the most probable token at each step.
We extend the source sentence using the unk token from the vocabulary.
We extend by sampling randomly from the vocabulary, under a uniform distribution. As with lm-sample, we can generalise this strategy by choosingdifferent samples.
5 Evaluation of Retranslation Strategies
Different approaches have been proposed to evaluate online SLT, so we explain and justify our approach here. We follow previous work on retranslation (Niehues et al., 2018; Arivazhagan et al., 2020, 2020) and consider that the performance of online SLT should be assessed according to three different aspects – quality, latency and flicker. All of these aspects are important to users of online SLT, but improving on one can have an adverse effect on other aspects. For example outputting translations as early as possible will reduce latency, but if these early outputs are incorrect then either they can be corrected (increasing flicker) or retained as part of the later translation (reducing quality). In this section we will define precisely how we measure these system aspects. We assume that selecting the optimal trade-off between quality, latency and flicker is a question for the system deployer, that can only be settled by user testing.
The latency of the MT system should provide a measure of the time between the MT system receiving input from the ASR, and it producing output that can be potentially be sent to the user. A standard (text-to-text) MT system would have to wait until it has received a full sentence before it produces any output, which exhibits high latency.
We follow Ma et al. (2019) by using a latency metric called average lag (AL), which measures the degree to which the output lags behind the input. This is done by averaging the difference between the number of words the system has output, and the number of words expected, given the length of the source prefix received, and the ratio between source and target length. Formally, AL for source and target sentences and is defined as:
where is the number of target words generated by the time the whole source sentence is received, is the number of source words processed when the target word is produced. In our implementation, we calculate the AL at token (not subword) level with the standard tokenizer in sacreBLEU (Post, 2018), meaning that for Chinese output we calculate AL on characters.
This metric differs from the one used in Arivazhagan et al. (2020)111In the presentation of this paper at ICASSP, the authors used latency metric similar to the one used here, and different to the one they used in the paper, where latency is defined as the mean time between a source word being received and the translation of that source word being finalised. However, this definition conflates latency and flicker, since outputting a translation and then updating is penalised for both aspects. The update is penalised for flicker since the translation is updated (see below) and it is penalised for latency, since the timestamp of the initial output is ignored in the latency calculation.
The idea of flicker is to obtain a measure of the potentially distracting changes that are made to the MT output, as its ASR-supplied source sentence is extended. We assume that straightforward extensions of the MT output are fine, but changes which require re-writing of part of the MT output should result a higher (i.e. worse) flicker score. Following Arivazhagan et al. (2020), we measure flicker using the normalised erasure (NE), which is defined as the minimum number of tokens that must be erased from each translation hypothesis when outputting the subsequent hypothesis, normalised across the sentence. As with AL, we also calculate the NE at token level for German, and at character level for Chinese.
As in previous work Arivazhagan et al. (2020), quality is assessed by comparing the full sentence output of the system against a reference, using a sentence similarity measure such as bleu
. We do not evaluate quality on prefixes, mainly because of the need for a heuristic to determine partial references. Further, quality evaluation on prefixes will conflate with evaluation of latency and flicker and thus we simply assume that if the partial sentences are of poor quality, that this will be reflected in the other two metrics (flicker and latency). Note that our proposed dynamic mask strategy is only concerned with improving the flicker–latency trade-off curve and has no effect on full-sentence quality, so MT quality measurement is not the focus of this paper. measures. Where we do require a measure of quality (in the assessment of biased beam search, which does change the full-sentence translation) we usebleu as implemented by sacreBLEU (Post, 2018).
6.1 Biased Beam Search and Mask-
We first assess the effectiveness of biased beam search and mask- (with a fixed ), providing a more complete experimental picture than in Arivazhagan et al. (2020), and demonstrating the adverse effect of biased beam search on quality. For these experiments we use data released for the IWSLT MT task (Cettolo et al., 2017), in both EnglishGerman and EnglishChinese. We consider a simulated ASR system, which supplies the gold transcripts to the MT system one token at a time222A real online ASR system typically increments its hypothesis by adding a variable number of tokens in each increment, and may revise its hypothesis. Also, ASR does not normally supply sentence boundaries, or punctuation, and these must be added by an intermediate component. Sentence boundaries may change as the ASR hypothesis changes. In this work we make simplifying assumptions about the nature of the ASR, in order to focus on retranslation strategies, leaving the question of dealing with real online ASR to future work..
For training we use the TED talk data, with dev2010 as heldout and tst2010 as test set. The raw data set sizes are 206112 sentences (en-de) and 231266 sentences (en-zh). We preprocess using the Moses (Koehn et al., 2007) tokenizer and truecaser (for English and German) and jieba333https://github.com/fxsjy/jieba for Chinese. We apply BPE (Sennrich et al., 2016) jointly with 90k merge operations. For our MT system, we use the transformer-base architecture (Vaswani et al., 2017) as implemented by Nematus (Sennrich et al., 2017). We use 256 sentence mini-batches, and a 4000 iteration warm-up in training.
As we mentioned in the introduction, we did experiment with prefix training (using both alignment-based and length-based truncation) and found that this improved the translation of prefixes, but generally degraded translation for full sentences. Since prefix translation can also be improved using the masking and biasing techniques, and the former does not degrade full sentence translation, we only include experimental results when training on full sentences.
In Figure 3 we show the effect of varying and on our three evaluation measures, for EnglishGerman.
Looking at Figure 3(a) we notice that biased beam search has a strong impact in reducing flicker (erasure) at all values of . However the problem with this approach is clear in Figure 3(b), where we can see the reduction in bleu caused by this biasing. This can be offset by increasing masking, also noted by Arivazhagan et al. (2020), but as we show in Figure 3(c) this comes at the cost of an increase in latency.
Our experiments with enzh show a roughly similar pattern, as shown in Figure 4. We find that lower levels of masking are required to reduce the detrimental effect on bleu of the biasing, but latency increases more rapidly with masking.
6.2 Dynamic Masking
We now turn our attention to the dynamic masking technique introduced in Section 4. We use the same data sets and MT systems as in the previous section. To train the LM, we use the source side of the parallel training data, and train an LSTM-based LM.
To assess the performance of dynamic masking, we measure latency and flicker as we vary the length of the source extension () and the number of source extensions () We consider the 4 different extension strategies described at the end of Section 4. We do not show translation quality since it is unaffected by the dynamic masking strategy. The results for both ende and enzh are shown in Figure 5, where we compare to the strategy of using a fixed mask-. The oracle data point is where we use the full-sentence translations to set the mask so as to completely avoid flicker.
We observe from Figure 5 that our dynamic mask mechanism improves over the fixed mask in all cases, by reducing both latency and flicker. Varying the source prediction strategy and parameters appears to preserve the same inverse relation between latency and flicker, although offering a different trade-off. Using several random source predictions (the green curve in both plots) offers the lowest flicker, at the expense of high latency, possibly because the prefix extension translations show a lot of variability. Using the LM for source prediction tends to have the opposite effect, favouring a reduction in latency. The pattern across the two language pairs is similar, although we observe a more dispersed picture for the enzh results.
In order to provide further experimental verification of our retranslation model, we apply the strategy to a larger-scale EnglishChinese system. Specifically, we use a model that was trained on the entire parallel training data for the WMT20 en-zh task444www.statmt.org/wmt20/translation-task.html, in addition to the TED corpus used above. For the larger model we prepared as before, except with 30k BPE merges separately on each language, and then we train a transformer-regular using Marian. By using this, we show that our strategy is able to be easily slotted in a different model from a different translation toolkit. The results are shown in Figure 6. We can verify that the pattern is unchanged in this setting.
To further explore how this dynamic mask strategy improves stability, we look at the EnglishGerman corpus and give several examples in Table 1. Here we do not compare with gold translations because we want to focus on how dynamic mask reduces the flicker caused by the MT system, rather than the overall quality of the translation (which is unaffected by dynamic mask). Note that in the second example, the longest common prefix between MT (Extension) and MT (e.g. empty string) is a prefix of Previous Output, thus we simply take the previous translation as the output for current source as described in Section 4.
We can see that in both examples, dynamic masks give more stable translations. Although fixed mask- strategy can also avoid flicker, it would require a very large global value to avoid flicker in the second example, and so result in redundant latency in the first example. Noticeably, translations for these two examples share similar length, which indicates that we cannot relax the global mask- strategy with length ratios of sentences to handle both examples perfectly. Thus, our proposed dynamic mask strategy is more flexible and accurate than mask- used in (Arivazhagan et al., 2020).
|Source||and I wonder what you’d choose , because I’ve been asking my friends|
|Extension (Pred)||and I wonder what you’d choose , because I’ve been asking my friends to find my own single|
|MT||Und ich frage mich, was Sie wählen würden, denn ich habe meine Freunde gefragt .|
|MT (Extension)||Und ich frage mich, was Sie wählen würden, denn ich habe meine Freunde gebeten, meine eigenen Einzelheiten zu finden.|
|MT (masked)||Und ich frage mich, was Sie wählen würden, denn ich habe meine Freunde|
|MT (stable)||Und ich frage mich, was Sie wählen würden, denn ich habe meine Freunde diese Frage oft gestellt und sie wollen alle zurück gehen.|
|Previous Output||Und ich frage mich, was Sie wählen würden, denn ich habe|
|Source||and , in fact , these kids don’t , so they’re going out and reading their school work|
|Extension (Pred)||and , in fact , these kids don’t , so they’re going out and reading their school work under them .|
|MT||Tatsächlich tun diese Kinder das nicht, also gehen sie raus und lesen ihre Schularbeit.|
|MT (Extension)||Und tatsächlich tun diese Kinder das nicht, also gehen sie raus und lesen ihre Schularbeit unter ihnen.|
|MT (masked)||Und tatsächlich tun diese Kinder das nicht, also gehen sie raus und lesen ihre|
|MT (stable)||Und tatsächlich tun diese Kinder das nicht, also gehen sie raus und lesen ihre Schularbeit unter den Straßenlampen.|
|Previous Output||Und tatsächlich tun diese Kinder das nicht, also gehen sie raus und lesen ihre|
We propose a dynamic mask strategy to improve the stability for the retranslation method in online SLT. We have shown that combining biased beam search with mask- works well in re-translation systems, but biased beam search requires a modified inference algorithm and hurts the quality and additional mask- used to reduce this effect gives a high latency. Instead, the dynamic mask strategy maintains the translation quality but gives a much better latency–flicker trade-off curve than mask-. Our experiments also show that the effect of this strategy depends on both the length and number of predicted extensions, but the quality of predicted extensions is less important.
For future research, we would like to combine biased beam search with dynamic mask to see if it can give a better trade-off between quality and latency than (Arivazhagan et al., 2020) when the flicker is small enough. We would also like to experiment our dynamic masking strategy on prefixes provided by a real ASR system.
[image=true, lines=2, findent=1ex, nindent=0ex, loversize=.15]eu-logo.pngThis work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (Elitr).
- Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §1, §2, §3, §3, §5, §5, §5, §5, §6.1, §6.1, §6.2, §7.
- Monotonic Infinite Lookback Attention for Simultaneous Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1313–1323. External Links: Cited by: §1, §2.
- Re-translation versus Streaming for Simultaneous Translation. arXiv e-prints, pp. arXiv:2004.03643. External Links: Cited by: §1, §2, §5.
- Real-time incremental speech-to-speech translation of dialogs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada, pp. 437–445. External Links: Cited by: §2.
- Overview of the IWSLT 2017 Evaluation Campaign. In Proceedings of IWSLT, Cited by: §6.1.
Can neural machine translation do simultaneous translation?. CoRR abs/1606.02012. External Links: Cited by: §1, §2.
- Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 1053–1062. External Links: Cited by: §2.
- Marian: fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, Melbourne, Australia, pp. 116–121. External Links: Cited by: §1.
Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. External Links: Cited by: §6.1.
- STACL: simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3025–3036. External Links: Cited by: §1, §2, §5.
- Low-latency neural speech translation. In Proceedings of Interspeech, Cited by: §1, §2, §5.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Cited by: §5, §5.
- Segmentation strategies for streaming speech translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 230–238. External Links: Cited by: §2.
- Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65–68. External Links: Cited by: §6.1.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §6.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §6.1.
- Simultaneous translation with flexible policy via restricted imitation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5816–5822. External Links: Cited by: §1, §2.
Simpler and faster learning of adaptive policies for simultaneous translation.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1349–1354. External Links: Cited by: §1, §2.