Simultaneous Neural Machine Translation (SNMT) addresses the problem of real-time interpretation in machine translation. In order to achieve live translation, an SNMT model alternates between reading the source sequence and writing the target sequence using either a fixed or an adaptive policy. Streaming SNMT models can only append tokens to a partial translation as more source tokens are available with no possibility for revising the existing partial translation. A typical application is conversational speech translation, where target tokens must be appended to the existing output.
For certain applications such as live captioning on videos, not revising the existing translation is overly restrictive. Given that we can revise the previous (partial) translation simply re-translating each successive source prefix becomes a viable strategy. The re-translation based strategy is not restricted to preserve the previous translation leading to high translation quality. The current re-translation approaches are based on the autoregressive sequence generation models (ReTA), which generate target tokens in the (partial) translation sequentially. As the source sequence grows, the multiple re-translations with sequential generation in ReTA models lead to an increased inference time gap causing the translation to be out of sync with the input stream. Besides, due to a large number of inference operations involved, the ReTA models are not favourable for resource-constrained devices.
In this work, we build a re-translation based simultaneous translation system using non-autoregressive sequence generation models to reduce the computation cost during the inference. The proposed system generates the target tokens in a parallel fashion whenever a new source information arrives; hence, it reduces the number of inference operations required to generate the final translation. To compare the effectiveness of the proposed approach, we implement the re-translation based autoregressive (ReTA) Arivazhagan et al. (2019a) and Wait-k Ma et al. (2019) models along with our proposed system. Our experimental results reveal that the proposed model achieves significant performance gains over the ReTA and Wait-k models in terms of computation time while maintaining the property of superior translation quality of re-translation over the streaming based approaches.
Revising the existing output can cause textual instability in re-translation based approaches. The previous approached proposed a stability metric, Normalized Erasure (NE) Arivazhagan et al. (2019a), to capture this instability. However, the NE only considers the first point of difference between pair of translation and fails to quantify the textual instability experienced by the user. In this work, we propose a new stability metric, Normalized Click-N-Edit (NCNE), which better quantifies the textual instabilities by considering the number of insertions/deletions/replacements between a pair of translations.
The main contributions of our work are as follows:
We propose re-translation based simultaneous system to reduce the high inference time of current re-translation approaches.
We propose a new stability metric, Normalized Click-N-Edit, which is more sensitive to the flickers in the translation as compared to existing stability metric, Normalized Erasure.
We conduct several experiments on simultaneous text-to-text translation tasks and establish the efficacy of the proposed approach.
2 Faster Retranslation Model
We briefly describe the simultaneous translation system to define the problem and set up the notations. The source and the target sequences are represented as and , with and
being the length of the source and the target sequences. Unlike the offline neural translation models (NMT), the simultaneous neural translation models (SNMT) produce the target sequence concurrently with the growing source sequences. In other words, the probability of predicting the target token at timedepends only on the partial source sequence (). The probability of predicting the entire target sequence is given by:
where and are the encoder and decoder layers of the SNMT model which produce the hidden states for the source and target sequences. The denotes the number of source tokens processed by the encoder when predicting the token , it is bounded by . For a given dataset , the training objective is,
In a streaming based SNMT model Ma et al. (2019), whenever new source information arrives, a new target token is generated by computing and is added to the current partial translation. On the other hand, the re-translation based SNMT models compute
from scratch every time the new source information arrives. Let us assume that the source tokens are coming with a stride, i.e., input tokens arriving at once, and is the computational time required to generate the sequence, where is the computational cost of the model for predicting one target token, then the re-translation based system takes due to the repeated computation of the translation.
|prev||I live South Korea and||-||-|
|current 1||I live in South Korea, and I am||3||1|
|current 2||I live in North Carolina and I am||3||3|
2.2 Faster Re-translation With NAT
To address the issue of high computation cost faced in the existing re-translation models, we design a new re-translation model based on the non-autoregressive translation (NAT) approach.
The encoder in our model is based on the Transformer encoder Vaswani et al. (2017) and the decoder is adopted from the Levenshtein Transformer (LevT) Gu et al. (2019b). We choose the LevT model as our decoder since it is a non-autoregressive neural language model and suits the re-translation objective of editing the partial translation. The overview of the proposed system, referred to as FReTNA, is illustrated with a German-English example in the Figure 1. We describe the main components of LevT and proposed changes to enable the smoother re-translation in the following paragraphs.
The LevT model parallelly generates all the tokens in the translation and iteratively modifies the translation by using insertion/deletion operations. These operations are achieved by employing Placeholder classifier, Token Classifier, and Deletion Classifier components in the Transformer decoder. The sequence of insertion operations are carried out by using the placeholder and token classifiers where the placeholder classifier is for finding the positions to insert the new tokens and the token classifier is for filling these positions with the actual tokens from the vocabulary. The sequence of deletion operations are performed by using the deletion classifier. The inputs to these classifiers come from the Transformer encoder and decoder blocks and are computed as:
where and are word and position embeddings of a token. The decoder outputs from the last Layer are later passed to the three classifiers to edit the previous translation by performing insertion/deletion operations. These operations are repeated whenever new source information arrives.
It predicts the number of tokens to be inserted between every two tokens in the current partial translation. As compared to the LevT’s placeholder classifier, we incorporate a positional bias which is given by the second term in the Eq. 5. As the predicted sequence length grows, the bias becomes stronger, and the model inserts lesser tokens at the start, reducing the flicker. The placeholder classifier with positional bias is given by:
where , , and . Based on the number of of tokens predicted by Eq. 5, we insert that many placeholders () at the current position and it is calculated for all the positions in the (partial) translation of length . Here, represents the maximum number of insertions between two tokens and is a learnable parameter which balances the predictions based on the hidden states and (partial) translation length.
The token classifier is similar to LevT’s token classifier, it fills in tokens for all the placeholders inserted by the placeholder classifier. This is achieved as follows:
where and is the placeholder token.
It scans over the hidden states (except for the start token and end token) and predicts whether to keep(1) or delete(0) each token in the (partial) translation. Similar to the placeholder classifier, we also add a positional bias to the deletion classifier to discourage the deletion of initial tokens of the translation as the source sequence grows. The deletion classifier with positional bias is given by:
where , and we always keep the boundary tokens ().
The model with these modified placeholder and deletion classifiers focuses more on appending the partial translation whenever new source information comes in, which results in smoother translation having lower textual instability. Here, is a learnable parameter.
The insertion and deletion operations are complementary; hence, we combine them in an alternate fashion. In each iteration, first we call the Placeholder classifier followed by Token classifier, and the Deletion classifier. We repeat this process till a certain stopping condition is met, i.e., generated translation is same in consecutive iterations, or MAX iterations are reached. In our experimental results, we found that two iterations of insertion-deletion operations are sufficient while generating the partial translation for the newly arrived source information. To produce the partial translation, the model incurs cost, where is the cost for insertion-deletion operations equals to since we also have similar decoding layer. The overall time complexity of our model (FReTNA) is 111The time complexities provided for ReTA and FReTNA models are for comparison and do not represent the actual computational costs, since all the target tokens are generated parallelly. The FReTNA computational cost is times less than the ReTA model.
We use imitation learning to train theFReTNA similar to the Levenshtein Transformer. Unlike Arivazhagan et al. (2020), which is trained on prefix sequences along with full-sentence corpus, we train the model on the full sequence corpus only. The expert policy used for imitation learning is derived from a sequence-level knowledge distillation process Kim and Rush (2016). More precisely, we first train an autoregressive model using the same datasets and then replace the original target sequence by the beam-search result of this model. Please refer to Gu et al. (2019b) for more details on imitation learning for LevT model.
At inference time, we greedily (beam size=1) apply the trained model over the streaming input sequence. For every set of new source tokens, we apply the insertion and deletions policies and pick the actions associated with high probabilities in Eq. 5, 6, and 7. During the re-translation based simultaneous translation, the partial translations are inherently revised when a new set of input token arrives; hence, we apply only two iterations of insertion-deletion sequence on the current partial translation. We also impose a penalty on current partial translation to match the prefix part of the previous translation by subtracting a penalty
from the logits in eq.5 and eq. 7.
2.5 Measuring Stability of Translation
One important property of the re-translation based models is that they should produce the translation output with as few textual instabilities or flickers as possible; otherwise, the frequent changes in the output can be distracting to the users. The ReTA model Arivazhagan et al. (2020) uses Normalized Erasure (NE) as a stability measure by following niehues2016dynamic; Niehues et al. (2018a); Arivazhagan et al. (2020a), it measures the length of the suffix that is to be deleted from the previous partial translation to produce the current translation. However, the metric does not account for the actual number of insertions/deletions/replacements, which provide a much better measure to gauge the visual instability. In the Table 1, the NE gives same penalty to both the current translations, however, the current translation 2 would obviously cause more visual instability as compared to the current translation 1. In order to have a better metric to represent the flickers during the re-translation, we suggest a new stability measure metric, called Normalized Click-N-Edit (NCNE). The NCNE is computed (Eq 8) using the Levenshtein distance Levenshtein (1966), which computes the number of insertions/deletions/replacements to be performed on the current translation to match the previous translation. As shown in the Table 1, the NCNE gives higher penalty to the current translation 2 since it has a higher visual difference as compared to the current translation 1. The NCNE measure aligns better with the textual stability goal of the re-translation based SMT models. The metric is given as
where and represent the current and previous translations.
We use three diversified MT language pairs to evaluate the proposed model: WMT’15 German-English(DeEn), IWSLT 2020 English-German(EnDe), WMT’14 English-French(EnFr) data.
DeEn translation task:
We use WMT15 German-to-English (4.5 million examples) as the training set. We use newstest2013 as dev set. All the results have been reported on the newstest2015.
EnDe translation task:
For this task, we use the dataset composition given in IWSLT 2020. The training corpus consists of MuST-C, OpenSubtitles2018, and WMT19, with a total of 61 million examples. We choose the best system based on the MuST-C dev set and report the results on the MuST-C tst-COMMON test set. The WMT19 dataset further consists of Europarl v9, ParaCrawl v3, Common Crawl, News Commentary v14, Wiki Titles v1 and Document-split Rapid for the German-English language pair. Due to the presence of noise in the OpenSubtitles2018 and ParaCrawl, we only use 10 million randomly sampled examples from these corpora.
EnFr translation task:
We use WMT14 EnFr (36.3 million examples) as the training set, newstest2012 + newstest2013 as the dev set, and newstest2014 as the test set.
More details about the data statistics can be found in the Appendix.
We adopt the evaluation framework similar to Arivazhagan et al. (2019a), which includes the metrics for quality and latency. The translation quality is measured by calculating the de-tokenized BLEU score using sacrebleu script Post (2018).
Most latency metrics for the simultaneous translation are based on delay vector, which measures how many source tokens were read before outputting the target token. To address the re-translation scenario where target content can change, we use content delay similar to Arivazhagan et al. (2019a). The content delay measures the delay with respect to when the token finalizes at a particular position. For example, in the Figure 1, the token appears as may at step 4, however, it is finalized at step 5 as could. The delay vector is modified based on this content delay and used in Average Lagging (AL) Ma et al. (2019) to compute the latency.
3.3 Implementation Details
The proposed FReTNA, ReTA Arivazhagan et al. (2020b), and Wait-k Ma et al. (2019) models are implemented using the Fairseq framework Ott et al. (2019). All the models use Transformer as the base architecture with settings similar to Arivazhagan et al. (2020). The text sequences are processed using word piece vocabulary Sennrich et al. (2016). All the models are trained on 4*NVIDIA P40 GPUs for 300K steps with the batch size of 4096 tokens. The ReTA is trained using prefix augmented training data and FReTNA uses distilled training dataset as described in Kim and Rush (2016)
. The hyperparameterdescribed in the Section 2.4 is set to , where . The Appendix contains more details about the implementation and hyperparameters settings.
In this section, we report the results of our experiments conducted on the DeEn, EnDe and the EnFr language pairs. In order to test our FReTNA system, we compare it with the recent approaches in re-translation (ReTA, Arivazhagan et al. (2019a)), and streaming based systems (Wait-k, Ma et al. (2019)). Unlike traditional translation systems, where the aim is to achieve a higher BLEU score, simultaneous translation is focused on balancing the quality-latency and the time-latency trade-offs. Thus, we compare all the three approaches based on these two trade-offs: (1) Quality v/s Latency and (2) Inference time v/s Latency. The latency is determined by the AL. The inference time signifies the amount of the time taken to compute the output (normalized per sentence).
|Step||ReTA Model||FReTNA Model|
|Input Sequence #1: Berichten zufolge hofft Indien darüber hinaus auf einen|
|Vertrag zur Verteidigungszusammenarbeit zwischen den beiden Nationen.|
|2||India is||India reportedly hopes|
|3||India is also||India is hopes reportedly to hoping for one .|
|4||India is also re||India is hopes reportedly to hoping for one defense treaty|
|5||India is also reporte||India is reportedly reportedly hoping for one defense treaty between the two nations .|
|6||India is also reportedly||India is also reportedly hoping for a defense treaty between the two nations .|
|7||India is also reportdly hoping||-|
|8||India is also reportedly hoping for||-|
|9||India is also reportedly hoping for a||-|
|10||India is also reportedly hoping for a treaty||-|
|11||India is also reportedly hoping for a treaty on||-|
|12||India is also reportedly hoping for a treaty on defense||-|
|13||India is also reportedly hoping for a treaty on defense cooperation||-|
|14||India is also reportedly hoping for a treaty on defense cooperation between the two nations.||-|
Quality v/s Latency:
We report the results on both the NE and NCNE stability metrics (Section 2.5). The Re-translation models have similar results with NCNE and NE metrics. However, with NE , the models have slightly inferior results since it imposes a stricter constraint for stability.
The proposed FReTNA model performance is slightly inferior in the low latency range and better in the medium to high latency range compared to Wait-k model for DeEn and EnFr language pairs. For EnDe, our models perform better in all the latency ranges as compared to the Wait-k model.
The slight inferior performance of FReTNA over ReTA is attributed to the complexity of anticipating multiple target tokens simultaneously with limited source context. However, FReTNA slightly outperforms ReTA from medium to high latency ranges for EnDe language pair.
Inference Time v/s Latency:
The Figure 4 shows the inference time v/s latency plots for DeEn, EnDe, and EnFr language pairs. Since our model simultaneously generates all target tokens, it has much lower inference time compared to the ReTA and Wait-k models. Generally, the streaming based simultaneous translation models such as Wait-k have lower inference time compared to re-translation based approaches such as ReTA, since the former models append the (partial) translation whereas the later models sequentially generate the (partial) translation from scratch for every newly arrived source information. Even though our FReTNA model is based on re-translation, it has lower inference time compared to the Wait-k and ReTA models since we adopt a non-autoregressive model to generate all the tokens in the (partial) translation parallelly.
For the comparison purpose, we also trained offline ReTA and FReTNA models for the three language pairs, and the results are reported in Table 2. The BLEU scores of SMT and offline models of ReTA and FReTNA are comparable. Thus, we can conclude that our proposed FReTNA approach is better than ReTA and Wait-k in terms of inference time in all the latency ranges, while maintaining the property of superior translation quality of re-translation over the streaming based approaches.
Impact of positional bias:
We evaluate the FReTNA model with and without including positional bias (FReTNA_pos vs FReTNA_non_pos) introduced in Eq. 5 and 7 to see whether positional bias can help the model to generate smoother translations. As shown in Figure 5 FReTNA_non_pos has more flickers compared to the FReTNA_pos model since it’s not able to cross the NCNE cutoff of 0.2 in the low latency range. The lower performance of FReTNA_non_pos in the low latency range is due to predicting more tokens (insertion policy) than required with less source information. Later, when more source information is available, then some of the tokens have to be deleted (deletion policy), causing more flickers in the final translation output. From Figure 5, we can see that positional bias reduces flickers in the translation and very useful in low latency range.
Sample Translation Process:
In the Table 3, we compare the process of generating the target sequence using ReTA and FReTNA models. The examples are collected by running inference using these two models on the DeEn test set. The ReTA generates the target tokens from scratch at every step in an autoregressive manner which leads to a high inference time. On the other hand, our FReTNA model generates the target sequence parallelly by inserting/deleting multiple tokens at each step. We included only one example here due to space constraints; more examples can be found in the Appendix.
4 Related Work
The earlier works in streaming simultaneous translation such as Cho and Esipova (2016); Gu et al. (2016); Press and Smith (2018) lack the ability to anticipate the words with missing source context. The Wait-k model introduced by Ma et al. (2019) brought in many improvements by introducing a simultaneous translation module which can be easily integrated into most of the sequence to sequence models. Arivazhagan et al. (2019b) introduced MILk which is capable of learning an adaptive schedule by using hierarchical attention; hence it performs better on the latency quality trade-off. Wait-k and MILk are both capable of anticipating words and achieving specified latency requirements.
Re-translation is a simultaneous translation task in which revisions to the partial translation beyond strictly appending of tokens are permitted. Re-translation is originally investigated by niehues2016dynamic; Niehues et al. (2018b). More recently, Arivazhagan et al. (2020b) extends re-translation strategy by prefix augmented training and proposes a suitable evaluation framework to assess the performance of the re-translation model. They establish re-translation to be as good or better than state-of-the-art streaming systems, even when operating under constraints that allow very few revisions.
Breaking the autoregressive constraints and monotonic (left-to-right) decoding order in classic neural sequence generation systems has been investigated. Stern et al. (2018); Wang et al. (2018) design partially parallel decoding schemes which output multiple tokens at each step. Gu et al. (2017) propose a non-autoregressive framework which uses discrete latent variables, and it is later adopted in Lee et al. (2018) as an iterative refinement process. Ghazvininejad et al. (2019) introduces the masked language modelling objective from BERT Devlin et al. (2018) to non-autoregressively predict and refine the translations. Welleck et al. (2019); Stern et al. (2019); Gu et al. (2019a) generate translations non-monotonically by adding words to the left or right of previous ones or by inserting words in arbitrary order to form a sequence. Gu et al. (2019b) propose a non-autoregressive Transformer model based on Levenshtein distance to support insertions and deletions. This model achieves a better performance and decoding efficiency compared to the previous non-autoregressive models by iteratively doing simultaneous insertion and deletion of multiple tokens.
We leverage the non-autoregressive language generation principles to build efficient re-translation systems having low inference time.
The existing re-translation model achieves better or comparable performance to the streaming simultaneous translation models; however, high inference time remains as a challenge. In this work, we propose a new approach for re-translation based simultaneous translation by leveraging non-autoregressive language generation. Specifically, we adopt the Levenshtein Transformer since it is inherently trained to find corrections to the existing (partial) translation. We also propose a new stability metric which is more sensitive to the flickers in the output stream. As observed from the experimental results, the proposed approach achieves comparable translation quality with a significantly less computation time compared to the previous autoregressive re-translation approaches.
- Re-translation strategies for long form, simultaneous, spoken language translation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7919–7923. Cited by: §2.5.
- Re-translation strategies for long form, simultaneous, spoken language translation. External Links: Cited by: §1, §1, §3.2, §3.2, §3.4.
- Monotonic infinite lookback attention for simultaneous machine translation. Cited by: §4.
- Re-translation versus streaming for simultaneous translation. In Proceedings of the 17th International Conference on Spoken Language Translation, Online, pp. 220–227. External Links: Cited by: §2.3, §2.5, §3.3.
- Re-translation versus streaming for simultaneous translation. External Links: Cited by: §3.3, §4.
- Can neural machine translation do simultaneous translation?. arXiv preprint arXiv:1606.02012. Cited by: §4.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.
- Mask-predict: parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324. Cited by: §4.
- Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281. Cited by: §4.
- Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics 7, pp. 661–676. Cited by: §4.
- Learning to translate in real-time with neural machine translation. arXiv preprint arXiv:1610.00388. Cited by: §4.
- Levenshtein transformer. In Advances in Neural Information Processing Systems, pp. 11181–11191. Cited by: §2.2, §2.3, §4.
- Sequence-Level Knowledge Distillation. arXiv e-prints, pp. arXiv:1606.07947. External Links: Cited by: §2.3, §3.3.
- Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901. Cited by: §4.
- Binary codes capable of correcting deletions, insertions and reversals.. Soviet Physics Doklady 10 (8), pp. 707–710. Note: Doklady Akademii Nauk SSSR, V163 No4 845-848 1965 Cited by: §2.5.
- STACL: simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3025–3036. External Links: Cited by: §1, §2.1, §3.2, §3.3, §3.4, §4.
- Low-latency neural speech translation. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana (Ed.), pp. 1293–1297. External Links: Cited by: §2.5.
- Low-latency neural speech translation. In Proc. Interspeech 2018, pp. 1293–1297. External Links: Cited by: §4.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §3.3.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §3.2.
- You may not need attention. arXiv preprint arXiv:1810.13409. Cited by: §4.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §3.3.
- Insertion transformer: flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249. Cited by: §4.
- Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10086–10095. External Links: Cited by: §4.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2.
Semi-autoregressive neural machine translation.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 479–488. External Links: Cited by: §4.
Non-monotonic sequential text generation. arXiv preprint arXiv:1902.02192. Cited by: §4.