Log In Sign Up

MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive Machine Translation

Conditional masked language models (CMLM) have shown impressive progress in non-autoregressive machine translation (NAT). They learn the conditional translation model by predicting the random masked subset in the target sentence. Based on the CMLM framework, we introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the NAT model. Specifically, MvSR consists of two parts: (1) shared mask consistency: we forward the same target with different mask strategies, and encourage the predictions of shared mask positions to be consistent with each other. (2) model consistency, we maintain an exponential moving average of the model weights, and enforce the predictions to be consistent between the average model and the online model. Without changing the CMLM-based architecture, our approach achieves remarkable performance on three public benchmarks with 0.36-1.14 BLEU gains over previous NAT models. Moreover, compared with the stronger Transformer baseline, we reduce the gap to 0.01-0.44 BLEU scores on small datasets (WMT16 RO↔EN and IWSLT DE→EN).


page 1

page 2

page 3

page 4


Glancing Transformer for Non-Autoregressive Neural Machine Translation

Non-autoregressive neural machine translation achieves remarkable infere...

Constant-Time Machine Translation with Conditional Masked Language Models

Most machine translation systems generate text autoregressively, by sequ...

Semi-Autoregressive Training Improves Mask-Predict Decoding

The recently proposed mask-predict decoding algorithm has narrowed the p...

Fast Structured Decoding for Sequence Models

Autoregressive sequence models achieve state-of-the-art performance in d...

Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation

Non-autoregressive translation (NAT) models are typically trained with t...

Layer-Wise Multi-View Learning for Neural Machine Translation

Traditional neural machine translation is limited to the topmost encoder...

Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21

In this paper, we describe the submission of the joint Samsung Research ...


Transformer has been the de facto architecture for Neural Machine Translation 

Vaswani et al. (2017). In this framework, the decoder generates words one by one in a left-to-right manner. Despite its strong performance, the autoregressive decoding method causes a large latency in the inference phase Gu et al. (2018). To break the bottleneck of the inference speed caused by the sequential conditional dependence, several non-autoregressive neural machine translation (NAT) models are proposed to generate all tokens in parallel (Figure 1(a)Gu et al. (2018); Łukasz Kaiser et al. (2018); Li et al. (2019); Ma et al. (2019). However, vanilla NAT models suffer from the cost of translation accuracy due to they remove the conditional dependence between target tokens.

To close the gap from autoregressive models, iterative NAT models are proposed to refine the translation results. They bring conditional dependency between target tokens within several iterations 

Ghazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Among them, Ghazvininejad et al. Ghazvininejad et al. (2019) first explore to apply conditional masked language model (CMLM) on NAT model (Figure 1(b)). Following this framework, several CMLM-based NAT models are proposed and obtain state-of-the-art performance compared with other NATs Xie et al. (2020a); Guo, Xu, and Chen (2020).

An open question is whether the potential of the CMLM-based NAT model has been fully exploited, since the masked language model has achieved significant breakthroughs in natural language processing.

(a) Vanilla NAT
(c) CMLM + shared mask cons.
(d) CMLM + model cons.
Figure 1: (a) Vanilla NAT model. (b) CMLM-based NAT model. (c) CMLM architecture with shared mask consistency, where the blue [MASK] means shared mask position in the two masked target sentences. (d) CMLM architecture with model consistency, where EMA means the exponential moving average method.

To answer the question, we introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the CMLM-based NAT model. Specifically, our approach includes two regularization methods: shared mask consistency and model consistency. For shared mask consistency, as shown in Figure 1(c), we randomly mask different subset of the same target sentence twice. Then we encourage the predicted distributions of the shared masked positions to be consistent with each other. As one example, consider the original sentence and two masked sentences in Table 1. The original token ”window” is replaced with [MASK] in both two masked sentences. Although the contexts of ”window” are different due to the random mask strategies, their semantics and generated distributions are expected to be consistent across these two views. To make a summary, we introduce a new paradigm of regularization, different mask strategies for the same target sentence, and the tokens on the shared masked positions are semantic-preserving with different views. This approach is reminiscent of multi-view contrast learning Tian, Krishnan, and Isola (2020), our method is not ”contrast” but only considers the consistency of ”positive pairs”.

original the cat went through an open window in the house .
masked the cat [MASK] [MASK] an open [MASK] in the house .
masked the cat went through an [MASK] [MASK] in the [MASK] .
Table 1: An example target sentence is randomly masked twice. The blue [MASK] indicates that the token is masked in both two masked sentences.

Regarding model consistency (Figure 1(d)), it is inspired by that checkpoint averaging is an essential method for improving the performance of machine translation Vaswani et al. (2017). Similarly, Mean Teacher Tarvainen and Valpola (2017) shows that using an average model as a teacher improves the results. Correspondingly, we construct an average model by updating the weights with an exponential moving average (EMA) method. Then we penalize the generated distributions that are inconsistent between this average model and the online model. Note that we adopt the bidirectional Kullback-Leibler (KL) divergence instead of mean squared error (MSE) as the consistency cost. This is related to mutual learning Zhang et al. (2018) but without extra parameters.

As in prior work, we apply our MvSR-NAT model in several public benchmark datasets. It outperforms previous NAT models and achieves comparable results with autoregressive Transformer. Intuitively, our two proposed regularization methods have two advantages: 1) they can be seen as stabilizers to promote the robustness of the model to randomness; 2) they reduce the discrepancy between the training and inference phase.

Specifically, the shared mask consistency first enhances the robustness of the model to the random mask. Secondly, we adopt the mask-predict decoding method Ghazvininejad et al. (2019), where the predicted target tokens are replaced by [MASK] symbols in the inference process. Especially in the first decoder iteration, all the target tokens are [MASK] symbols. This decoding strategy causes the discrepancy from training for random mask. Therefore, As a result of a more robust model to random mask, our proposed method can reduce the discrepancy between training and inference caused by [MASK] symbols, thus improving the translation quality.

As for model consistency, it first penalizes the model sensitivity to the model weights, thus improving the robustness. Secondly, the average model and the online model have the same architecture but with different dropout units during training. Therefore, this regularization item also makes our model more robust to random dropout. Moreover, the dropout is closed during inference thus causing the discrepancy between training and inference. By reason of more robust to dropout, the proposed model consistency method implicitly strengthens the generalization ability of the model and improves the performance with dropout closing during inference.

Experimental results demonstrate that our model outperforms several state-of-the-art NAT models by over 0.36-1.14 BLEU on WMT14 ENDE and WMT16 ENRO datasets. Compared with the strong autoregressive Transformer (AT) baseline, our proposed NAT model achieves competitive performance, while significantly reducing the cost of time during inference.


Non-autoregressive Machine Translation

Recently, we have witnessed tremendous progress in neural machine translation (NMT) Sutskever, Vinyals, and Le (2014); Bahdanau, Cho, and Bengio (2015); Vaswani et al. (2017). Given a source sentence , a NMT model is aimed to generate target sentence

. Typically, the probability model of an autoregressive model is defined as

, where is the parameters of a network. It is formulated as a chain of conditional probabilities:


where and are [BOS] and [EOS], representing the beginning and the end of a target sentence, respectively. Note that the autoregressive NMTs adopt teacher forcing Vaswani et al. (2017) method to capture the sequential conditional dependency between target tokens. And during inference, they generate the target tokens one by one in a left-to-right manner.

As the performance of NMT models have been substantially promoted, non-autoregressive machine translation (NAT) with paralleled decoding becomes a research hotspot Gu et al. (2018), the architecture is shown in Figure 1(a). We define the probability model of a NAT model as . Mathematically, it is parameterized by a conditional independent factorization:


NAT model removes the conditional dependency between the target words, thus generating all target tokens simultaneously. Although the translation speed has been significantly accelerated, the lack of dependence between target words reduces the translation quality. To promote the performance of NAT models, a promising research line is iterative decoding methods Lee, Mansimov, and Cho (2018); Ghazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Specifically, they explicitly consider the conditional dependency between target tokens within several decoding iterations, thus refining the translation results.

Conditional Masked Language Model

Since the masked language model (MLM) is proposed by BERT Devlin et al. (2019), it has achieved a significant breakthrough in natural language understanding Liu et al. (2019); Joshi et al. (2020); Song et al. (2019); Dong et al. (2019); Sun et al. (2021); Ahmad et al. (2021). However, due to the bidirectional nature of MLM, it is non-trivial to extend MLM for language generation tasks. Wang et al. Wang and Cho (2019) start with a sentence of all [MASK] tokens and generate words one by one in arbitrary order (instead of the standard left-to-right chain decomposition), obtaining inadequate generation quality compared with autoregressive counterparts Brown et al. (2020). Recently, XLM Lample and Conneau (2019) leverages sentence-pair translation data for training a conditional masked language model (CMLM), which improves the performance on several downstream tasks, including machine translation.

Upon previous works, as shown in Figure 1(b), Ghazvininejad et al. Ghazvininejad et al. (2019) adopt CMLM to optimize the non-autoregressive NMT model. During training, they predict the masked target tokens conditional on the source sentence and the rest of observed words in the target sentence. Therefore, the training objective of the CMLM-based NAT is presented as:


where denotes the number of masked tokens in target sentence. During inference, they propose a Mask-Predict decoding strategy, which iteratively refines the generated translation given the most confident target words predicted from the previous iteration. In this paper, our work is built upon this CMLM-based NAT model and improve its performance with two proposed consistency regularization techniques.

Consistency Regularization

Consistency regularization has merged as a gold-standard technique for semi-supervised learning 

Sajjadi, Javanmardi, and Tasdizen (2016); Laine and Aila (2017); Zhai et al. (2019); Oliver et al. (2018); Xie et al. (2020b); Zheng et al. (2021). One strand of this idea is to regularize predictions to small perturbations on image data or language. These semantic-preserving augmentations can be image flipping or cropping, or adversarial noise on image data Miyato et al. (2019); Carmon et al. (2019) and natural language example Zhu et al. (2019); Liu et al. (2020). Another strand of consistency regularization aims at penalizing sensitivity to model parameters Tarvainen and Valpola (2017); Athiwaratkun et al. (2019); Liang et al. (2021). In our work, we focus on the conditional masked language model setting, leveraging the two strands of consistency regularization.

Figure 2: The framework of our MvCR-NAT model, consisting of an online model and an average model. This figure depicts a training example with two different mask strategy. Both the two masked sentences are feed to these two models. We update the parameters of the online model with gradient descent.Then we update the average model with an exponential moving average method after every training step.


Model Architecture

Figure 2 illustrates the overall architecture of our proposed MvCR-NAT model. It is built upon the CMLM framework, comprising of two Transformer-based modules, an Encoder and a Decoder. It is worth noting that the Encoder structure is based on Transformer Vaswani et al. (2017), while the Decoder is slightly different. Specifically, the Decoder replaces the left-to-right mask with a bidirectional attention mask, allowing the Decoder to leverage both left and right contexts to predict the target words. To focus on our main contributions, we omit the detailed architecture and refer readers to  Ghazvininejad et al. (2019) for reference.

In our work, we focus on the training of our model with two proposed regularization methods. Before elaborating on the details, we first present some notations. Given a source sentence and a target sentence , the goal of training is to learn a probability model . To construct the target input for our approach, we randomly mask the target sentence twice. We define the subset of masked tokens as and , and the observed unmasked tokens as and , respectively. As shown in Figure 2, we feed the two masked sentences to the online model (blue part in the figure). Upon the probability model of the CMLM-based NAT presented in Equation 3, our learning objective is learning to minimize the negative log-likelihood (NLL) loss of the masked tokens, which is parameterized as:


where represents the parameters of the online model. and represent the number of masked words, respectively. and represent the generated distributions from the online model. And we obtain the two nll losses and , respectively.

Note that the masked words in and are randomly selected and replaced by the [MASK] symbols. As shown in Table 1, the shared masked tokens in both and are marked as [MASK] symbol. And we set the collection of the shared masked words as .

Consistency Regularization

In this Section, we propose to improve the CMLM-based NAT with consistency regularization method. Specifically, we focus on regularizing the generated predictions to be invariant to model parameters and semantic-preserving data perturbations.

Model Consistency

We introduce the model consistency regularization method to encourage consistent predictions from an online model and an average model. The average model weights are maintained by an exponential moving average (EMA) method (yellow part in Figure 2). Previous works have demonstrated that averaging model weights tend to achieve better performance than using the final model weight Polyak and Juditsky (1992); Vaswani et al. (2017); Tarvainen and Valpola (2017). To take this advantage of the average model, we adopt the bidirectional KL divergence to encourage the prediction consistency between these two models. Therefore we can increase the robustness of our model to model weights and learn better representations. Furthermore, similar to the recent ARXIV paper Liang et al. (2021), the dropout units between the online model and the average model is different due to the randomness. Thus the prediction consistency penalizes the sensitivity to random dropout. Totally, the proposed method brings two practical advantages: first, we strengthen the robustness of our model to stochastic model weights. Second, this method robustly improves the model generalization and reduces the discrepancy between training and inference caused by dropout.

Formally, we define the model consistency loss as the distance between the token-level predictions produced by the online model and the average model using the bidirectional KL divergence:


where represents the parameters of the average model. In order to ensure readability, and are the abbreviation of and . They represent the predictions from the online model and the average model with the first masked sentence, respectively. Similar for and , but with the second masked sentence.

Moreover, the average model parameters is obtained by EMA method. At training step , the updated is computed as the EMA of successive weights:


Shared Mask Consistency

We propose our shared mask consistency regularization method in this part. As examples shown in Table 1 and Figure 2, we randomly mask the same target sentence twice, and forward them to the online model and the average model. Considering a simple example, there is a sentence pairs ”the [MASK] is [MASK] . Diese Katze ist lustig .” and ”[MASK] cat is [MASK] . Diese Katze ist lustig .”. The shared [MASK] to predict can be thought of a token-level ”positive pair” with semantic-preserving. We hypothesis that the representation is view-agnostic, and the semantic is shared between different views caused by randomly mask. Therefore, the predicted distributions of shared masked tokens are expected to be consistent. Note that we do not consider the distribution consistency of the other positions. To illustrate this reason, let’s take an example, in the second position, ”cat” is observed in the second target sentence, but it is masked in the first sentence. If we force the distributions of the second position to be consistent, the model will be confused with ”[MASK]” and ”cat”, thus leading inferior performance.

Mathematically, we define the shared mask consistency cost to measure the distance of prediction distributions in shared mask position. similar to model consistency, we adopt the bidirectional KL divergence:


where represents the number of shared masked tokens between and . , , represent the share mask consistency between the two masked sentence when they are feed into online-online, online-average, and average-online models, respectively.

Length Prediction

Autoregressive NMT models generate words one-by-one, and the length of the target sentence is decided by encountering a special token [EOS]. However, non-autoregressve NATs generate target sentence in a parallel way, thus requiring the predicted length before decoding. Following Gu et al. (2018); Ghazvininejad et al. (2019), we add an additional special token [LEN] to the begining of the source input. Then we predict the length

by the source sentence X. Mathematically, we define the loss function of this classfication task as:


where represents the max length of the target sentence in our corpus.

Input: Training data , ,
Output: online model parameters and average model parameters

1:  Initialize and copy to ;
2:  while not converged do do
3:     randomly sample data ;
4:     randomly mask twice, and obtain two masked sentences, where the masked subset are and , the observed subset are and ;
5:     feed the two sentence pairs to the online model, obtain the distribution and ;
6:     feed the same examples to the average model, obtain the distribution and ;
7:     calculate the log-likelihood loss and by Equation 4;
8:     calculate the model consistency losses and by Equation 5;
9:     calculate the shared mask consistency losses , and by Equation 7;
10:     calculate the length prediction loss by Equation 8;
11:     update the online model parameters by minimizing the total loss of Equation 9;
12:     update the average model parameters using EMA method.
13:  end while
Algorithm 1 Training Algorithm.

Training Algorithm

The final training objective for MvSR-NAT is the sum of all aforementioned loss functions:



is a hyperparameter to control KL losses. Jointly training with our proposed two consistency regularization losses, we improve the robustness and generalization ability of our MvCR-NAT model to the randomness (e.g., model weights and random mask in target sentence).

The overall training algorithm of our model is presented in Algorithm 1.


During inference, we feed a sentence with all [MASK] as target input for the first iteration, where its length is determined by the length prediction. Then we refine the translation result by masking-out and re-predicting a subset of words whose probabilities are under a threshold within several iterations. For more details, please refer to Ghazvininejad et al. (2019).


Experimental Setup


We conduct our experiments on five public benchmarks: WMT14 ENDE (4.5M translation pairs), WMT16 ENRO (610K translation pairs), and IWSLT DEEN (150K translation pairs). We strictly follow the dataset configurations of previous works. Specifically, we preprocess the dataset following Lee, Mansimov, and Cho (2018). Then we tokenize the tokens into subword units using BPE method Sennrich, Haddow, and Birch (2016). For WMT14 ENDE, we use newstest-2013 and newstest-2014 as our development and test datasets, respectively. For WMT16 EnRo, we use newsdev-2016 and newstest-2016 as our development and test datasets, respectively.

Evaluation Metrics

We adopt the widely used BLEU Papineni et al. (2002) to evaluate the translation quality.

Models WMT’14 WMT’16 IWSLT14 Speedup
Autoregressive Models
LSTMS2SBahdanau, Cho, and Bengio (2015) T 24.60 - - - - -
ConvS2SGehring et al. (2017) T 26.42 - - - - -
Transformer Vaswani et al. (2017) T 28.04 32.69 34.13 34.46 32.99 1.00x
Non-Autoregressive Models
NAT-FT Gu et al. (2018) 1 19.17 23.20 29.79 31.44 24.21 2.36x
Imit-NAT Wei et al. (2019) 1 24.15 27.28 31.45 31.81 - -
NAT-Hint Li et al. (2019) 1 21.11 25.24 - - - -
Flowseq Ma et al. (2019) 1 23.72 28.39 29.73 30.72 - 1.1x
NAT-DCRF Sun et al. (2019) 1 26.07 29.68 - - 29.99 9.63x
GLAT-NAT Qian et al. (2021) 1 26.55 31.02 32.87 33.51 - 7.9x
NAT-IR Lee, Mansimov, and Cho (2018) 5 20.26 23.86 28.86 29.72 - -
10 21.61 25.48 29.32 30.19 23.94 1.50x
CMLM-NAT  Ghazvininejad et al. (2019) 1 18.05 21.83 27.32 28.20 - 27.51x
4 25.94 29.90 32.53 33.23 30.42 9.79x
10 27.03 30.53 33.08 33.31 31.71 3.77x
MvCR-NAT (w/ kd) 4 26.25 30.27 32.76 32.96 - 9.79x
10 27.39 31.18 33.38 33.56 - 3.77x
MvCR-NAT (w/o kd) 4 22.89 26.89 32.34 33.60 30.58 9.79x
10 24.37 28.90 33.76 34.45 32.55 3.77x
Table 2: The BLEU scores of our proposed MvCR-NAT model and the baseline models on the WMT14 En-De/De-En, WMT16 En-Ro/Ro-En and IWSLT14 De-En tasks. represents the number of iterations while inference. ”kd” represents the sequential level knowledge distillation. In the column of speedup, we adopt seconds/sentence to measure the decoding speed, where Transformer is set as the baseline (beam size = 5).

Model Details

The online model setting is strictly following previous works. For the WMT14 ENDE and WMT16 ENRO datasets, the model setting is based on the base Transformer Vaswani et al. (2017). Specifically, we set the model dimension as 512 and the inner dimension as 2048. The Encoder and the Decoder consist of a stack of 6 Transformer layers. For the smaller IWSLT16 DEEN dataset, we follow the configuration of small Transformer that the model dimension and inner dimension are 256 and 1024, respectively. The Encoder and the Decoder consist of a stack of 5 Transformer layers. Besides, we set the max target sentence as 1000.

The average model is built upon the online model with the same architecture. Besides, its model weights are maintained by an exponential moving average method with the moving average decay set as 0.996.

During training, we train the model with 2048 tokens per batch on eight GTX 2080Ti GPUs. We use Adam optimizer Kingma and Ba (2015) and warmup learning rate schedule. During inference, we set the number of the length candidate as 5 for our NAT model. For a fair comparison, we set the beam size as 5 for the baseline AT model. Moreover, we evaluate the final translation accuracy by averaging 10 checkpoints.

Sequential-Level Knowledge Distillation

Previous works have demonstrated that the effectiveness of sequential-level knowledge distillation on NAT models Gu et al. (2018); Lee, Mansimov, and Cho (2018); Gu, Wang, and Zhao (2019); Ghazvininejad et al. (2019); Zhou, Neubig, and Gu (2020). Following their works, we train our CMCR-NAT model on the distilled corpora, which are produced by a standard left-to-right Transformer model. While previous AT transformers have different performances, we adopt the one used in CMLM-NAT Ghazvininejad et al. (2019) which is our primary baseline. In Section Effect of Sequential-level Knowledge Distillation, we will identify the effect of knowledge distillation on our model.

Model Variants WMT’16 ENRO WMT’16 ROEN
=4 =10 =4 =10
CMLM-NAT 31.40 32.86 32.87 33.15
+ model consistency 31.89 (+0.49) 33.10(+0.24) 33.15 (+0.28) 33.52 (+0.37)
+ shared mask concsistency 32.16 (+0.76) 33.49(+0.63) 33.79 (+0.92) 34.08 (+0.93)
MvCR-NAT 32.34 (+0.94) 33.76(+0.90) 33.60(+0.73) 34.45 (+1.30)
Table 3: Evaluation of model consistency and shared mask consistency on WMT16 ENRO without knowledge distillation. means decoding iterations.


Table 2 shows our experimental results on three public datasets. As we move our eyes to the first part in this Table, our model achieves comparable performance with the Transformer model. Notably, on the small dataset WMT16 ENRO and IWSLT14 DEEN, the translation results are only 0.01-0.44 BLEU score behind.

In the second part of Table 2, compared with pure NAT models with one-shot decoding, the multiple iterative decoding methods achieve noticeable improvements. The same thing happens to the CMLM-based NAT models. This phenomenon is mainly due to the problem of multimodality Gu et al. (2018) that the one-shot decoding hardly considers the left-to-right dependency. While the iterative methods explicitly model the conditional dependency between target tokens within several iterations, thus obtaining better performance.

In contrast with our primary baseline CMLM-NAT model, our model is additionally optimized with two regularization methods without changing the CMLM architecture. Our model outperforms CMLM-NAT with margins from 0.36-1.14 BLEU scores, illustrating the effectiveness of our methods.

Effect of Sequential-level Knowledge Distillation

The comparison results for knowledge distillation are shown in Table 2. In terms of the large dataset, i.e., WMT14 ENDE, our model gains improvements with the sequential-level knowledge distillation. However, the improvements from knowledge distillation are not concurrent on the small dataset, i.e., WMT16 ENRO. We attribute this phenomenon to the complexity of the data sets Zhou, Neubig, and Gu (2020). The knowledge distillation is able to reduce ”modes” (alternative translations for an input) in the training data, thus benefiting the NAT models. We conjecture that a small dataset is likely to contain fewer redundant “modes” than a large-scale dataset. As a result, distillation knowledge is helpful and more efficient on a large dataset than on a small dataset.

Figure 3:

The BLEU scores with training epochs on IWSLT14 DE-EN task.

Ablation Study

Model Consistency vs. Shared Mask Consistency

As shown in Table 3, we conduct comparative experiments on the validation set of the WMT16 ENRO task to illustrate the contribution of our proposed two regularization methods. Note that the results are computed without knowledge distillation. Compared with the CMLM-NAT baseline model, our proposed model consistency and shared mask consistency regularization methods progressively improve the performance, and the shared mask consistency provides more performance promotion.

Furthermore, to step further understand the two proposed regularization methods, in Figure 3, we show the BLEU score with training epochs on IWSLT14 DEEN task with single decoding iteration. To make a fair comparison, in every training forward pass, we feed two source-target sentence pairs to these compared models. The training curves help us understand the effect of the two proposed regularization methods. We can see that a) the model consistency improves the performance without changing the convergence trend; b) the shared mask consistency method suppresses the convergence speed of the model in the early training period, but obtains better performance in the final training epochs. It indicates that the shared mask consistency method can avoid premature fitting and improve the robustness and the generalization ability of our model.

Effect of Weight

KL Loss Weight WMT’16 ENRO WMT’16 ROEN
33.68 34.40
33.76 34.45
33.42 34.02
33.10 33.43
32.72 32.68
Table 4: Evaluation of kl loss weight .

We investigate the effect of the loss weight , which is utilized for controlling the KL-divergence loss. We conduct ablation experiments on WMT16 ENRO with different values in . The results are shown in Table 4. We can see that the small kl loss weight performs better than the larger ones. In our setting, the best choice of the kl loss weight is . Too much regularization (e.g. 3) even decreases the model performance.

-time Shared Mask Consistency

KL Loss Weight WMT’16 ENRO WMT’16 ROEN
33.76 34.45
33.53 34.50
Table 5: Evaluation of -time Shared Mask Consistency.

As an example shown in Table 1, we forward two masked target sentences to the model and encourage their masked subset predictions to be consistent. An interesting concern is whether more improvements can be achieved if we forward three or more masked targets with different mask strategies. In this study, we define the number of masked targets as . We conduct comparative experiments about the on the WMT16 ENRO dataset. The results in Table 5 show that is good enough for the tasks. This indicates that our proposed two consistency regularization methods have strong regularization effect between two distributions, without the necessity of more distributions regularization.

Dropout Probability in Average Model

Dropout Prob. WMT’16 ENRO WMT’16 ROEN
0.1 33.17 33.87
0.2 33.36 34.19
0.3 33.76 34.45
0.4 33.39 34.40
0.5 33.10 34.25
Table 6: Evaluation of dropout probability in average model.

As mentioned in subsection Model Consistency, we indicate that the model consistency method strengthens the robustness of our model to model weights and randomly dropout. Here, we investigate the different dropout rates in the average model. In this study, we apply different dropout values for the average models during training. As shown in Table 6, we test the dropout values from on WMT16 ENRO dataset. We can see that the best choice of dropout value for the average model is 0.3.


In this paper, upon CMLM-based architecture, we introduce the Multi-view Subset Regularization method to improve the CMLM-based NAT performance. We first propose the shared mask consistency method to force the masked subset predictions to be consistent for randomly mask strategies. Second, we propose model consistency to encourage the online model to generate consistent distributions with the average model whose weights are maintained with an EMA method. On several benchmark datasets, we demonstrate that our approach achieves considerable improvements against previous non-autoregressive models and comparable results to the autoregressive Transformer model. This work introduces a new paradigm for regularization method, the multi-view subset regularization. We hope this paradigm can be helpful in recent hot contrast learning models.