Simultaneous machine translation (SiMT) Cho2016; gu-etal-2017-learning; ma-etal-2019-stacl; Arivazhagan2019 begins outputting translation before reading the entire source sentence and hence has a lower latency compared to full-sentence machine translation. In practical applications, SiMT usually has to fulfill the requirements with different levels of latency. For example, a live broadcast requires a lower latency to provide smooth translation while a formal conference focuses on translation quality and allows for a slightly higher latency. Therefore, an excellent SiMT model should be able to maintain high translation quality under different latency levels.
However, the existing SiMT methods, which usually employ fixed or adaptive policy, cannot achieve the best translation performance under different latency with only one model ma-etal-2019-stacl; Ma2019a. With fixed policy, e.g., wait-k policy ma-etal-2019-stacl, the SiMT model has to wait for a fixed number of source words to be fed and then read one source word and output one target word alternately. In wait-k policy, the number of words to wait for can be different during training and testing, denoted as and respectively, and the latency is determined by . Figure 1 gives the performance of the model trained with under different , and the results show that under different the SiMT model with the best performance corresponds to different
. As a result, multiple models should be maintained for the best performance under different latency. With adaptive policy, the SiMT model dynamically adjusts the waiting of source tokens for better translation by directly involving the latency in the loss functionArivazhagan2019; Ma2019a. Although the adaptive policy achieves the state-of-the-art performance on the open datasets, multiple models need to be trained for different latency as the change of model latency is realized by the alteration of the loss function during training. Therefore, to perform SiMT under different latency, both kinds of methods require training multiple models for different latency, leading to large costs.
Under these grounds, we propose a universal simultaneous machine translation model which can self-adapt to different latency, so that only one model is trained for different latency. To this end, we propose a Mixture-of-Experts Wait-k Policy (MoE wait-k policy) for SiMT where each expert employs the wait-k policy with its own number of waiting source words. For the mixture of experts, we can consider that different experts correspond to different parameter subspaces future-guided, and fortunately the multi-head attention is designed to explore different subspaces with different heads NIPS2017_7181. Therefore, we employ multi-head attention as the implementation manner of MoE by assigning different heads with different waiting words number (wait-1,wait-3,wait-5, ). Then, the outputs of different heads (aka experts) are combined with different weights, which are dynamically adjusted to achieve the best translation under different latency.
Experiments on IWSLT15 EnVi, WMT16 EnRo and WMT15 DeEn show that although with only a universal SiMT model, our method can outperform strong baselines under all latency, including the state-of-the-art adaptive policy. Further analyses show the promising improvements of our method on efficiency and robustness.
Our method is based on mixture-of-experts approach, multi-head attention and wait-k policy, so we first briefly introduce them respectively.
2.1 Mixture of Experts
Mixture of experts (MoE) 10.1162/neco.1922.214.171.124; DeepExperts; shazeer2017outrageously; peng-etal-2020-mixture is an ensemble learning approach that jointly trains a set of expert modules and mixes their outputs with various weights:
where is the number of experts, and are the outputs and weight of the expert, respectively.
2.2 Multi-head Attention
Multi-head attention is the key component of the state-of-the-art Transformer architecture NIPS2017_7181, which allows the model to jointly attend to information from different representation subspaces. Multi-head attention contains attention heads, where each head independently calculates its outputs between queries, keys and values through scaled dot-product attention. Since our method and wait-k policy are applied to cross-attention, the following formal expressions are all based on cross-attention, where the queries come from the decoder hidden state , and the keys and values come from the encoder outputs . Thus, the outputs of the head when decoding the target token is calculated as:
where represents dot-product attention of the head, , and are learned projection matrices, is the dimension of keys. Then, the outputs of heads are concatenated and fed through a learned output matrix
to calculate the context vector:
2.3 Wait-k Policy
Wait-k policy ma-etal-2019-stacl refers to first waiting for source tokens and then reading and writing one token alternately. Since is input from the outside of the model, we call the external lagging. We define as a monotonic non-decreasing function of , which represents the number of source tokens read in when generating the target token. In particular, for wait-k policy, given external lagging , is calculated as:
In the wait-k policy, the source tokens processed by the encoder are limited to the first tokens when generating the target token. Thus, each head outputs in the cross-attention is calculated as:
where represents the encoder outputs when the first source tokens are read in.
The standard wait-k policy ma-etal-2019-stacl trains a set of SiMT models, where each model is trained through a fixed wait- and tested with corresponding wait- (). multipath proposed multipath training, which uniformly samples in each batch during training. However, training with both and definitely make the model parameters confused between different subspace distributions.
3 The Proposed Method
In this section, we first view multi-head attention from the perspective of the mixture of experts, and then introduce our method based on it.
3.1 Multi-head Attention from MoE View
Multi-head attention can be interpreted from the perspective of the mixture of experts peng-etal-2020-mixture, where each head acts as an expert. Thus, Eq.(3) can be rewritten as:
is a row-wise block sub-matrix representation of . is the outputs of the expert at step , and is the weight of . Therefore, multi-head attention can be regarded as a mixture of experts, where experts have the same function but different parameters () and the normalized weights are equal ().
3.2 Mixture-of-Experts Wait-k Policy
To get a universal model which can perform SiMT with a high translation quality under arbitrary latency, we introduce the Mixture-of-Experts Wait-k Policy (MoE wait-k) into SiMT to redefine the experts and weights in multi-head attention (Eq.(7)). As shown in Figure 2, experts are given different functions, i.e., performing wait-k policy with different latency, and their outputs are denoted as . Meanwhile, under the premise of normalization, the weights of experts are no longer equal but dynamically adjusted according to source input and latency requirement, denoted as . The details are introduced following.
3.2.1 Experts with Different Functions
The experts in our method are divided into different functions, where each expert performs SiMT with different latency. In addition to the external lagging in standard wait-k policy, we define expert lagging , where
is the hyperparameter we set to represent the fixed lagging of theexpert. For example, for a Transformer with 8 heads, if we set , then each expert corresponds to one head and 8 experts concurrently perform wait-1, wait-3, wait-5,, wait-15 respectively. Specifically, given , the outputs of the head at step is calculated as:
where is the number of source tokens processed by the expert at step and is the number of all available source tokens read in at step . During training, is uniformly sampled in each batch with multipath training multipath. During testing, is the input test lagging.
Then, the outputs of the expert when generating target token is calculated as:
3.2.2 Dynamic Weights for Experts
Each expert has a clear division of labor through expert lagging . Then for different input and latency, we dynamically weight each expert with the predicted , where can be considered as the confidence of expert outputs . The factor to predict consists of two components:
: The average cross-attention scores in the expert at step , which are averaged over all source tokens read in Zheng2019b.
: External lagging in Eq.(8).
At step , all and
are concatenated and fed through the multi-layer perceptron (MLP) to predict the confidence scoreof the expert, which are then normalized to calculate the weight :
where and are parameters of MLP to predict . Given expert outputs and weights , the context vector is calculatas:
The algorithm details of proposMoE wait-k policy are shown in Algorithm 1. At decoding step , each expert performs the wait-k policy with different latency according to the expert lagging , and then the expert outputs are dynamically weighted to calculate the context vector .
3.2.3 Training Method
We apply a two-stage training, both of which apply multipath training multipath, i.e., randomly sampling ( in Eq.(8)) in every batch during training. First-stage: Fix the weights equal to and pre-train expert parameters. Second-stage: jointly fine-tune the parameters of experts and their weights. In the inference time, the universal model is tested with arbitrary latency (test lagging). In Sec.5, we compare the proposed two-stage training method with the one-stage training method which directly trains the parameters of experts and their weights together.
We tried the block coordinate descent (BCD) training peng-etal-2020-mixture which is proposed to train the experts in the same function, but it is not suitable for our method, as the experts in MoE wait-k have already assigned different functions. Therefore, our method can be stably trained through back-propagation directly.
4 Related Work
Mixture of experts MoE was first proposed in multi-task learning 10.1162/neco.19126.96.36.199; 10.1145/1015330.1015432; Liu_2018_ECCV; 10.1145/3219819.3220007; Dutt2018CoupledEO. Recently, shazeer2017outrageously applied MoE in sequence learning. Some work he-etal-2018-sequence; pmlr-v97-shen19c; cho-etal-2019-mixture applied MoE in diversity generation. peng-etal-2020-mixture applied MoE in MT and combined heads in Transformer as an expert.
Previous works always applied MoE for diversity. Our method makes the experts more regular in parameter space, which provides a method to improves the translation quality with MoE.
SiMT Early read / write policies in SiMT used segmented translation bangalore-etal-2012-real; Cho2016; siahbani-etal-2018-simultaneous. grissom-ii-etal-2014-dont predicted the final verb in SiMT. gu-etal-2017-learning
trained a read / write agent with reinforcement learning.Alinejad2019 added a predict operation based on gu-etal-2017-learning.
Recent read / write policies fall into two categories: fixed and adaptive. For the fixed policy, dalvi-etal-2018-incremental proposed STATIC-RW, and ma-etal-2019-stacl proposed wait-k policy, which always generates target tokens lagging behind the source. multipath enhanced wait-k policy by sampling different during training. han-etal-2020-end applied meta-learning in wait-k. future-guided proposed future-guided training for wait-k policy. zhang-feng-2021-icts proposed a char-level wait-k policy. For the adaptive policy, Zheng2019b trained an agent with gold read / write sequence. Zheng2019a added a “delay” token to read. Arivazhagan2019 proposed MILk, which used a Bernoulli variable to determine writing. Ma2019a proposed MMA, which is the implementation of MILk on the Transformer. zheng-etal-2020-simultaneous ensembled multiple wait-k models to develop a adaptive policy. zhang-zhang-2020-dynamic and zhang-etal-2020-learning-adaptive proposed adaptive segmentation policies. bahar-etal-2020-start and wilken-etal-2020-neural proposed alignment-based chunking policy.
A common weakness of the previous methods is that they all train separate models for different latency. Our method only needs a universal model to complete SiMT under all latency, and meanwhile achieve better translation quality.
We evaluated our method on the following three datasets, the scale of which is from small to large.
IWSLT15111nlp.stanford.edu/projects/nmt/ EnglishVietnamese (En-Vi) (133K pairs) iwslt2015 We use TED tst2012 (1553 pairs) as the validation set and TED tst2013 (1268 pairs) as the test set. Following LinearTime and Ma2019a, we replace tokens that the frequency less than 5 by . After replacement, the vocabulary sizes are 17K and 7.7K for English and Vietnamese, respectively.
WMT16222www.statmt.org/wmt16/ EnglishRomanian (En-Ro) (0.6M pairs) lee-etal-2018-deterministic We use news-dev2016 (1999 pairs) as the validation set and news-test2016 (1999 pairs) as the test set.
WMT15333www.statmt.org/wmt15/ GermanEnglish (De-En) (4.5M pairs) Following the setting from ma-etal-2019-stacl and Ma2019a, we use newstest2013 (3000 pairs) as the validation set and newstest2015 (2169 pairs) as the test set.
For En-Ro and De-En, BPE sennrich-etal-2016-neural is applied with 32K merge operations and the vocabulary is shared across languages.
5.2 System Settings
We conducted experiments on following systems.
Offline Conventional Transformer NIPS2017_7181 model for full-sentence translation, decoding with greedy search.
Standard Wait-k Standard wait-k policy proposed by ma-etal-2019-stacl. When evaluating with the test lagging , we apply the result from the model trained with , where .
Optimal Wait-k An optimal variation of standard wait-k. When decoding with , we traverse all models trained with different and apply the optimal result among them. For example, if the best result when testing with wait-1 () comes from the model trained by wait-5 (), we apply this optimal result. ‘Optimal Wait-k’ selects the best result according to the reference, so it can be considered as an oracle.
Multipath Wait-k An efficient training method for wait-k policy multipath. In training, is no longer fixed, but randomly sampled from all possible lagging in each batch.
MU A segmentation policy base on meaning units proposed by zhang-etal-2020-learning-adaptive
, which obtains comparable results with SOTA adaptive policy. At each decoding step, if a meaning unit is detected through a BERT-based classifier, ‘MU’ feeds the received source tokens into a full-sentence MT model to generate the target token and stop until generating thetoken.
MMA444github.com/pytorch/fairseq/tree/master/examples/simultaneous_translation Monotonic multi-head attention (MMA) proposed by Ma2019a, the state-of-the-art adaptive policy for SiMT, which is the implementation of ‘MILk’ Arivazhagan2019 based on the Transformer. At each decoding step, ‘MMA’ predicts a Bernoulli variable to decide whether to start translating or wait for the source token.
MoE Wait-k A variation of our method, which directly trains the parameters of experts and their weights together in one-stage training.
Equal-Weight MoE Wait-k A variation of our method. The weight of each expert is fixed to .
MoE Wait-k + FT Our method in Sec.3.2.
We compare our method with ‘MMA’ and ‘MU’ on De-En(Big) since they report their results on De-En with Transformer-Big.
|[1, 6, 11, 16]|
|[1, 3, 5, 7, 9, 11, 13, 15]|
The implementation of all systems are adapted from Fairseq Library ott-etal-2019-fairseq, and the setting is exactly the same as ma-etal-2019-stacl and Ma2019a. To verify that our method is effective on Transformer with different head settings, we conduct experiments on three types of Transformer, where the settings are the same as NIPS2017_7181. For En-Vi, we apply Transformer-Small (4 heads). For En-Ro, we apply Transformer-Base (8 heads). For De-En, we apply both Transformer-Base and Transformer-Big (16 heads). Table 2 reports the parameters of different SiMT systems on De-En(Big). To perform SiMT under different latency, both ‘Standard Wait-k’, ‘Optimal Wait-k’ and ‘MMA’ require multiple models, while ‘Multipath Wait-k’, ‘MU’ and ‘MoE Wait-k’ only need one trained model.
Expert lagging in MoE wait-k is the hyperparameter we set, which represents the lagging of each expert. We did not conduct many searches on
, but set it to be uniformly distributed in a reasonable lagging interval, as shown in Table1. We will analyze the influence of different settings of in our method in Sec.6.5.
We evaluate these systems with BLEU post-2018-call for translation quality and Average Lagging (AL555github.com/SimulTrans-demo/STACL.) ma-etal-2019-stacl for latency. Given , latency metric AL is calculated as:
where and are the length of the source sentence and target sentence respectively.
5.3 Main Results
Figure 3 and Figure 4 show the comparison between our method and the previous methods on Transformer with the various head settings. In all settings, ‘MoE wait-k + FT’ outperforms the previous methods under all latency. Our method improves the performance of SiMT much closer to the offline model, which almost reaches the performance of full-sentence MT when lagging 9 tokens.
Compared with ‘Standard Wait-k’, our method improves 0.60 BLEU on En-Vi, 2.11 BLEU on En-Ro, 2.33 BLEU on De-En(Base), and 2.56 BLEU on De-En(Big), respectively (average on all latency). More importantly, our method only needs one well-trained universal model to complete SiMT under all latency, while ‘Standard wait-k’ requires training different models for each latency. Besides, ‘Optimal Wait-k’ traverses many models to obtain the optimal result under each latency. Our method dynamically weights experts according to the test latency, and outperforms ‘Optimal Wait-k’ under all latency, without searching among many models.
Both our method and ‘Multipath Wait-k’ can train a universal model, but our method avoids the mutual interference between different sampled during training. ‘Multipath Wait-k’ often improves the translation quality under low latency, but on the contrary, the translation quality under high latency is poor elbayad-etal-2020-trac. The reason is that sampling a slightly larger in training improves the translation quality under low latency ma-etal-2019-stacl; future-guided, but sampling a smaller destroys the translation quality under high latency. Our method introduces expert lagging and dynamical weights, avoiding the interference caused by multipath training.
Compared with ‘MMA’ and ‘MU’, our method performs better. ‘MU’ sets a threshold to perform SiMT under different latency and achieves good translation quality, but it is difficult to complete SiMT under low latency as it is a segmentation policy. As a fixed policy, our method maintains the advantage of simple training and meanwhile catches up with the adaptive policy ‘MMA’ on translation quality, which is uplifting. Furthermore, our method only needs a universal model to perform SiMT under different latency and the test latency can be set artificially, which is impossible for the previous adaptive policy.
5.4 Ablation Study
We conducted ablation studies on the dynamic weights and two-stage training, as shown in Figure 3 and Figure 4. The translation quality decreases significantly when each expert is set to equal-weight. Our method dynamically adjusts the weight of each expert according to the input and test lagging, resulting in concurrently performing well under all latency. For the training methods, the two-stage training method makes the training of weights more stable, thereby improving the translation quality, especially under high latency.
We conducted extensive analyses to understand the specific improvements of our method. Unless otherwise specified, all the results are reported on De-En with Transformer-Base(8 heads).
6.1 Performance on Various Difficulty Levels
The difference between the target and source word order is one of the challenges of SiMT, where many word order inversions force to start translating before reading the aligned source words. To verify the performance of our method on SiMT with various difficulty levels, we evenly divided the test set into three parts: EASY, MIDDLE and HARD. Specifically, we used fast-align666https://github.com/clab/fast_align dyer-etal-2013-simple to align the source with the target, and then calculated the number of crosses in the alignments (number of reversed word orders), which is used as a basis to divide the test set 2020arXiv201011247C; future-guided. After the division, the alignments in the EASY set are basically monotonous, and the sentence pairs in the HARD set contains at least 12 reversed word orders.
Our method outperforms the standard wait-k on all difficulty levels, especially improving 3.90 BLEU on HARD set under low latency. HARD set contains a lot of word order reversal, which is disastrous for low-latency SiMT such as testing with wait-1. The standard wait-k enables the model to gain some implicit prediction ability ma-etal-2019-stacl, and our method further strengthens it. MoE wait-k introduces multiple experts with varying expert lagging, of which the larger expert lagging helps the model to improve the implicit prediction ability future-guided, while the smaller expert lagging avoids learning too much future information during training and prevents the illusion caused by over-prediction 2020arXiv201011247C. With MoE wait-k, the implicit prediction ability is stronger and more stable.
6.2 Improvement on Robustness
Robustness is another major challenge for SiMT zheng-etal-2020-fluent
. SiMT is often used as a downstream task of streaming automatic speech recognition (ASR), but the results of streaming ASR are not stable, especially the last recognized source tokenli-etal-2020-bits; gaido-etal-2020-end; zheng-etal-2020-fluent. In each decoding step, we randomly modified the last source token with different proportions, and the results are shown in Figure 5.
Our method is more robust with the noisy last token, owing to multiple experts. Due to different expert lagging, the number of source tokens processed by each expert is different and some experts do not consider the last token. Thus, the noisy last token only affects some experts, while other experts would not be disturbed, giving rise to robustness.
6.3 Differentiation of Experts Distribution
Our method clearly divides the experts into different functions and integrates the expert outputs from different subspaces for better translation. For ‘Multipath Wait-k’ and our method, we sampled 200 cases and reduced the dimension of the expert outputs (evaluating with wait-5) with the t-Distributed Stochastic Neighbor Embedding (tSNE) technique, and shown the subspace distribution of the expert outputs in Figure 6.
The expert outputs in ‘Multipath Wait-k’ have a little difference but most of them are fused together, which shows some similarities in heads. In our method, due to the clear division of labor, the expert outputs are significantly different and regular in the subspace distribution, which proves to be beneficial to translation li-etal-2018-multi. Besides, our method has better space utilization and integrate multiple designated subspaces information.
6.4 Superiority of Dynamic Weights
Different expert outputs are dynamically weighted to achieve the best performance under the current test latency, so we calculated the average weight of each expert under different latency in Table 4.
Through dynamic weighting, the expert lagging of the expert with the highest weight is similar to the of the optimal model with standard wait-k, meanwhile avoiding the traversal on many trained models. When the test lagging is larger, the expert with larger expert lagging has higher weight; and vice versa. Besides, the expert with a slightly larger expert lagging than tends to get the highest weight for better translation, which is in line with the previous conclusions ma-etal-2019-stacl; future-guided. Furthermore, our method enables the model to comprehensively consider various expert outputs with dynamic weights, thereby getting a more comprehensive translation.
6.5 Effect of Expert Lagging
Expert lagging is the hyperparameter we set to control the lagging of each expert. We experimented with several settings of to study the effects of different expert lagging , as shown in Figure 7.
Totally, all types of outperform the baseline, and different only has a slight impact on the performance, which shows that our method is not sensitive to how to set . Furthermore, there are some subtle differences between different , where the ‘Original’ setting performs best. ‘Low interval’ and ‘High interval’ only perform well under a part of the latency, as their is only concentrated in a small lagging interval. ‘Repeated’ performs not well as the diversity of expert lagging is poor, which lost the advantages of MoE. The performance of ‘Wide span’ drops under low latency, because the average length of the sentence is about 20 tokens where the much larger lagging is not conducive to low latency SiMT.
In summary, we give a general method for setting expert lagging . should maintain diversity and be uniformly distributed in a reasonable lagging interval, such as lagging 1 to 15 tokens.
7 Conclusion and Future Work
In this paper, we propose Mixture-of-Experts Wait-k Policy to develop a universal SiMT, which can perform high quality SiMT under arbitrary latency to fulfill different scenarios. Experiments and analyses show that our method achieves promising results on performance, efficiency and robustness.
In the future, since MoE wait-k develops a universal SiMT model with high quality, it can be applied as a SiMT kernel to cooperate with refined external policy, to further improve performance.
We thank all the anonymous reviewers for their insightful and valuable comments. This work was supported by National Key R&D Program of China (NO. 2017YFE0192900).